MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Twitter Changes Its API Rules, Makes Sure Monetization Strategy Works Out

Clarified that restrictions only apply to “core clients.” (March 16, 2011, 11:44)

You might have heard that Twitter recently updated their Twitter API Rules. In case you don’t know (or care), Twitter provides an extensive web API which allows you to perform all the critical functions from third party applications over the internet.

Together with the changes came a lengthy post explaining in particular the restrictions on applications to provide a “consistent user experience”, whatever that is supposed to mean.

In essence, and the post is pretty clear about it, Twitter does not want third party clients for Twitter. The new rules forbid to provide one of the core Twitter functionalities like tweeting, retweeting, and in particular trends in a way which differs too much from how the official Twitter clients do. TechCrunch has a nice post summarizing Twitter’s position.

While I agree that it makes sense to have core functionalities consistently named and designed, the changed API rules also specifically mention user suggestions and trending topics. I think that Twitter is missing out on a big chance to make their service more valuable by discouraging people from putting more advanced trending and recommender algorithms into clients.

Discoverability of interesting content has always been one of Twitter’s weak points. The currently available user suggestions and trending topics are certainly a first step in the right direction, but there is still a lot of room for improvement. For example, I think that trending topics need to be much more customized to the individual user than the global topics available right now. The sheer volume of tweets on Twitter means that there are literally thousands of topics being discussed at any given time, and not all of those topics are interesting to everyone. Global trends are dominated by the masses, but not everyone shares the masses’s interests.

Strictly speaking, these restrictions only apply to “core clients” which provide an essential part of the Twitter functionality. Twitter does encourage people to build publisher tools, curation, or real-time analysis tools. But still, what good are these tools if they aren’t integrated into your Twitter client? Wouldn’t you want to easily access neat stuff like individualized trending topics from your client instead of having to switch apps or go to a different web-site?

So why has Twitter chosen to deliberately force clients to conform with their standards and use their suggestions and trending topics? You only need to look at their current plans to make money to understand why. You can pay Twitter to have your tweets, users or topics show up in people’s timelines, user suggestions, or trending topics. I’m sure Twitter would be having a hard time charging for this service if people could just switch their client to make all that noise disappear.

I might be wrong, but to me this looks a lot like “consistence user experience” primarily means “make sure the users will consistently see the ads people have paid for.”

From the Cluetrain Manifesto to Social Media

I recently finished reading the “Cluetrain Manifesto” my friend Leo (a.k.a. thinkberg) pointed me to. Originally published in 1999 during the dot-com bubble, it was really fun reading it now, more than ten years later. (Although one could have probably condensed the text into about a third of its original length.)

For those of you who were as ignorant as me (or who were just too young in 1999 to care), the book is all about how the Internet could transform the current state of business to get back to a place where it’s more about people having actual conversations as opposed to mass produced, mass marketed goods being shoved down customer’s throats.

“Markets are Conversations” is one of the main mantras of the book, the other two possibly being “Rediscover your Human Voice” and “Removing the Firewalls that Separate Customers from Employees”.

The authors argue that originally (back when we were still living in small huts in the forest), markets were all about conversations. People would meet at markets, exchange news, talk to one another and have real conversations with the people they are doing business with. The people who built something usually also were the ones who knew everything about their business.

According to the authors, everything went downhill starting with the industrial revolution. Production processes were streamlined such that workers didn’t need to know much about their craft. Just as workers became dehumanized and exchangeable, so did customers. No longer were you talking to individual customers, instead talked to focus groups through one-way ad campaigns.

Back in 1999, the authors of the Cluetrain Manifesto saw the possibility to change all this with the Internet. They said that the ease with which people can connect and communicate on the Internet allows companies to re-engage in actual conversations with their customers, but also to have their employees themselves reconnect to one another. The hyperlink itself defies hierarchy as it can connect different parts without requiring getting permission from anyone.

Their vision was a world where there is no place for companies with a rigid hierarchy, built like a fortress to keep the employees and customers apart.

For me, the most interesting part of reading the book was that it actually predates the whole social media movement of the last years. The Internet of the book mostly consists of static web sites, email, mailing lists, and usenet news groups (any one still knows what that is? Actually, it was/is kind of a decentralized feed forward discussion forum, made more or less obsolete by faster Internet connections. I think Thunderbird can still connect to a usenet server if you want to. Just go File > New > Other Account and select “Newsgroup account”) Actually, blogs weren’t mainstream yet: Blogger was launched on August 23, 1999, LiveJournal on April 15, 1999.

They also didn’t have Twitter, Facebook, Flickr, or YouTube, and no Wikipedia.

So where are we know? I think part of what they have said actually came true. Customers have definitely become more connected. For every product that is launched, you can find thousands of blog posts, forum discussions, etc. to get first hand information about how that product really is. Some companies certainly have become more open, are active on Twitter, trying to engage in actual conversations with their customers.

On the other hand, for many companies, it’s still business as usual. You seldom see some guy working for a company participate in those discussions to give unfiltered first-hand information. Companies which come closest to the scenario described in the Cluetrain Manifesto are probably companies whose products are developed under an Open Source scheme. There, it is pretty common to be very active on bug trackers, mailing lists, etc.

Also concerning the removal of the Wall between employees and customers, it seems that most companies are still clinging to that strict separation. Sun has been quite open about allowing customers to see inside the company, but right now we’re are seeing how all these pages are overhauled to fit the Oracle corporate identity. Also, in all the research institutions I have worked, there was definite pressure to move from whatever web page you have to one which conforms to the corporate identity. Which is of course not a bad thing in itself, only that this usually also means that you have to use the official CMS, and depending on how lucky you get (at one institute, this was some overpriced more or less hand-written little thing whose web interface only worked with the Internet Explorer), this can seriously stifle your creativity. Since we’ve switched to TYPO3 at TU Berlin, most people have given up on maintaining their page because the whole process seems too complex (which is not just the problem with TYPO3, but also with what’s involved to get an account for that, and so on.)

So are we done with the Cluetrain Manifesto? Certainly not. I think Twitter and Facebook have shown us new ways to have conversations on the web, to find and share information in real time, but I think there is still room for improvement and the search for the best metaphor for open virtual conversations.

Twitter has shown us the value of having explorable and discoverable conversations in real time (Having tweets containing your user name show up automatically as well as being able to have RSS feeds on search terms is a great thing), but Twitter as a whole is still a bit too unstructured (every tried following a conversation?). Facebook has more structure but is currently forcing its users to move from a private to a more public model of conversations with all the friction to be expected. I’m also not convinced that a monolithic closed platform is not the right way to go. After all, the “Internet way” is decentralized systems built on open standards.

I think the Cluetrain Manifesto is still spot on at its core: It’s about finding a human voice and having real conversations between real people.

jblas 1.2.0: A look behind the scenes

I’ve just release jblas 1.2.0. The main additions are: Generalized eigenvalues and some support for 64 bit Windows in the form of pure FORTRAN (i.e. non-ATLAS) libraries:

  • jblas now has routines for generalized eigenvalues for symmetric matrices. The code has been provided by Nicolas Oury. See org.jblas.Eigen.

  • Finally, jblas comes with prebuilt libraries for Windows 64 bit. The bundled libraries are not ATLAS, though, which still cannot be compiled using cygwin, but lapack-lite. They aren’t terribly fast (matrix-matrix multiplication is about 50% faster than what I managed to do in pure Java), but at least you have the full functionality under Windows with 64 bit.

To celebrate this event, I thought I’ll let you in on some of the internals behind jblas, be it only to make sure that you never want to do this yourself ;)

The short version of what jblas does is that it builds a matrix library on high performance BLAS and LAPACK implementations like ATLAS.

As usual, the long version is a bit more involved. Here is a few of the things which need to be done to achieve this:

  • Compile ATLAS or another implementation of BLAS or LAPACK.

  • Create JNI stubs for each FORTRAN routine you want to package. Note that JNI is for C, so actually you have to bridge between C and FORTRAN as well in the stubs by translating C to FORTRAN calling conventions.

  • Create Java classes with lots of “native” methods so that Java knows about your functions.

  • Finally, write the matrix classes which use the native code.

  • For ease of use, package the shared libraries into the jar file and have them extract and load automatically for the right operating system, and platform, and possible processor type.

Automating Stub Generation

Since writing the JNI stubs is highly repetetive code, I actually wrote a bit of Ruby which parses the FORTRAN source code for BLAS and LAPACK, extract the signatures of the FORTRAN functions and automatically generate the JNI stubs. This is the code you find in the scripts subdirectory.

jblas actually does a bit more than just parsing out the type signatures. BLAS and LAPACK both use highly standardized comment sections which also identify which of the variables are input and which are output (FORTRAN always passes by reference, so you can write all the arguments passed to your function). I use this information to be more selective when freeing arrays in the stubs. In JNI, when you free an array, you can indicate whether you want to copy back the changes or not (JNI_ABORT vs. returning a zero). Since this copying forth and back is an expensive operation, I try to identify when it is not necessary and do not copy the data back in those cases.

The code generated by the stubs also checks whether arrays are used in more than one place (when you pass an array twice to a function in different arguments), in order to further minimize the number of copy operations. For some operations like copying data within one array, this alias detection is also strictly necessary, because if you would copy the array twice, it would depend on the order in which you release the arrays whether the changes will be copied back or not.

Another issue with LAPACK is the automatic computation of workspace sizes. Many of the routines require additional work space, and they have a special way of querying the amount of space required (usually by calling with a specific flag). Again this type of code is highly repetitive, so I also added code to detect workspace variables (usually ending in WORK) and also generate that code on the Java side.

Finally, depending on whether you use f2c or gfortran, there are different calling conventions for passing back complex numbers.

More Code Generation

Another area where I resorted to code generation was with float versions of all routines. Since Java isn’t generic in primitive types, you have to basically write a float version of all double version by hand. I’ve automated this project again with some Ruby scripts (one which generates for example FloatMatrix from DoubleMatrix, and one which duplicates each function with a float version, for example, in classes like Eigen).

These Ruby scripts are run as part of the build process.

Multi-platform Jars

The jar file contains the shared libraries for each operating system and processor subtype (where applicable). In order to determine the operating system, jblas uses the os.name and os.arch system properties. For distinguishing between SSE2 and SSE3, a bit more magic is necessary. In the class org.jblas.util.ArchFlavor, I again use some native code to invoke the CPUID command to determine the processor’s capabilities.

Once jblas has identifies the right shared library, it is extracted from the jar file with getResourceAsStream and copied to a temp directory from where the shared library is loaded with System.load().

The jblas Build Process

The build process is divided into a native part which generates the JNI stubs, and a Java part which regenerates the float versions and compiles the Java classes. This means that in the ideal case where you are just adding more functionality on the Java side, you don’t have to go through the native process at all, but can just work with all the shared libraries which are contained in src/main/resources.

The configure scripts is actually something homebrewn in Ruby. At that time it seemed to me that given the mix of C and Java, and quite specific operations like finding out which is the right LAPACK library containing all the required functions is already so specific that I’d be more happy if I wrote something myself instead of trying to make autotools do that. Actually, the configure script is structured like a Rake file in terms of interdependent configure variables which are then automatically invoked in the right order, but that is another story… .

The only time you need to touch the shared libraries is when you add new FORTRAN routines. Unfortunately, this also means you have to regenerate the code for all platforms, which is the reason why such releases take me a few day to finish as I don’t have all computers available in one place.

In summary…

In summary, there is a lot going on behind the scenes to give you just that: A jar file which you can just put into your classpath end provides with really high-performance matrix routines.