MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Year's end thoughts: When products stop making sense

So I’ve got the upgrade blues. Last week, the Android 4.3 update (from 4.1, both named Jelly Bean, incidentally) became available and although I read reports about much reduced battery life times, I though, “well they certainly have weeded out these bugs by now, right?” But as you can imagine, this is exactly what happened. The first day I thought “ok, I’ve been playing around looking for new features all day long”, but by now it’s pretty clear that something is wrong. Not only that but the whole phone feels less responsive. I mean I should have known better. I always tell people not to take future upgrades into account when buying a new phone (maybe apart from the Nexuses where it is sort of their main point). Updates are always late and somehow never achieve the same amount of polish the original release had.

I really should have known better because I already lived through all of this with my first Android phone, the HTC Desire. It was tragically underpowered in terms of app storage space which just kept getting smaller and smaller with each update. Originally released with 2.1, the update to 2.2 came, but the update to Gingerbread (2.3) just took forever. Finally they released it as a “developer upgrade” which you had to install yourself over USB, wiping all data on the phone, leaving you with less than 100MB for apps.

And the worst thing is that there is really no feature I was holding my breath for or any bug I needed to resolve. Instead Android just keeps getting more and more bloated. NFC? I don’t even use bluetooth. Apparently now there’s an option to use mobile Internet when that is faster then the available Wifi which is nice but yet another layer of complexity. Going through the various system tools to find just which app is using so much battery, you see system internal which remind much more of Linux than of a mobile device targetted at endusers.

So why is it that Android just keeps getting worse? It’s as if Google has never heard of the KISS principle: Keep It Simple, Stupid. Which is funny because I normally criticize Google for overapplying engineering principles to product design. Instead of keeping Android simple, they are just pushing more cruft on devices. It only works because hardware vendors are at the same time ramping up the specs on the new devices. Just think about it: I’m carrying around a computer with a quad-core CPU clocked at more than 1 GHz with 2GB of RAM and a few GB of permanent storage. I remember how impressed I was when the first Pentium processors with more than 1 GHz came out, and the required cooling on those was absolutely insane.

So if you’ve made it so far in this post, you’d probably agree with me that many features of Android have just stopped making sense. At some point, the actual benefit for the user was replaced by a marketing need to just provide one more feature than the previous generation of devices, or to push the envelope such that people need faster phones to get basically the same level of responsiveness.

Don’t get me wrong, I’m a big fan of smartphones in general, and there’s a lot which I find very useful. At the same time, there are things like the abysmal battery life which I just try to ignore because it would just annoy me massively if I thought about it for too long (do you remember that you used to have a full day to find a charger when the phone indicated it was about to run out of juice?)

We’ve seen tremendous technological progress in the last 100 years or so, and there were some great products which differentiated themselves from the status quo by giving new functionality to large numbers of people. For example, take transatlantic phone lines. The first phone line had only 36 channels, meaning that only a small number of people could access the lines at a time, and you had to manually set up the connections through operators. The first transcontinental call from New York City to San Francisco took 23 minutes to set up. Nowadays we can just dial any number anywhere in the world and get connected. That’s what I call progress and a great product. Before, only corporations and a small number of privileged people could do long-distance calls, but now virtually anyone can stay connected with family and friends abroad.

Now take the Google Chromebook.

I can partly understand why Google built the Chromebook. Because they could. Because all those Internet companies somehow believe that users will just trust them with all their data. Because they wanted to achieve a much higher level of technology lock-in of their customers. Maybe because they actually believed that the enduser would benefit from the vast processing powers of the Google army of servers.

But let’s face it, the most peculiar thing about the Google Chromebook is that it is just inferior in every aspect to a normal notebook except for the price. It’s basically saying “OK, you get a computer for 100 bucks less than a real computer, but it works just worse than a real one. Oh and we get to keep all your data. Unless we change our minds, then you have 3 months to download your stuff and afterwards we will throw it away.” Don’t say Google wouldn’t dare to do that, that is exactly what happened with Google Reader, and Google Latitude.

Price alone is not a good differentiator for all products, mostly for those where the product itself is already very standardized (like a litre of milk. Oh sorry, a gallon. No that also doesn’t sound right. What’s the usual quantity for milk in the US?)

In a way we programmers are also always product builders, and much too easily we just heap feature after feature on our projects, just to differentiate us from the competition, or because we can. But I think the real test is whether we build something which enables a sufficiently large set of people to do something they couldn’t do before.

This is what I’ll try to keep in mind for 2014.

Mark Zuckerberg NIPS - my brain melts.

So this actually happened, Mark Zuckerberg, CEO of Facebook has attended the annual NIPS conference in Lake Tahoe, Nevada. I didn’t attend, but from what I gathered from the social networks, it really happened. He participated in a panel discussion at the deep learning workshop, was present at a Facebook party (from where the first picture above is), where it was announced that Yann LeCun will head the new AI lab of Facebook.

I think Mark Zuckerberg attending is pretty remarkable. The hype around deep learning has increased this year with stories like Google image search being mostly powered by deep learning, or the acquisition of the Toronto based deep learning startup DNNresearch by Google.

The media has picked up on Deep Learning just the way they have with Big Data and Data Science, occasionally leading to articles like this one claiming that the methods are even too smart for engineers to understand. If you even have a little experience with machine learning, you know that this is no exception, it’s actually the rule. While you will hopefully have understood how the learning algorithms works, “understanding” how an SVM or a deep network performs its predictions in detail are hidden behind thousands of elementary computations.

Don’t get me wrong, I think that any kind of publicity which informs the general public of the importance of machine learning and data analysis is good, even if it is sometimes misinformed.

Still it’s an interesting to see how NIPS has transformed over the years. Originally founded in 1987, it began as a very neuro-centric conference (as was the fashion back then). The name spelled out, Neural Information Processing Systems still hints at that fact. For the first 15 years or so, biological neuroprocessing systems were still a focus, with papers reporting on actual biological findings. The conference was always single track (as opposed to ICML, for example) which meant that scoring a talk would guarantee quite an impact for your work. NIPS has always been very competitive with acceptance rates well below 20%, and its extensive poster sessions are legendary, running well past midnight with frustrated hotel staff turning of the air condition in an attempt to drive the attendants out.

The first time I attended NIPS was in 2001. It really was a different time back then. There was no wifi, I didn’t even have a laptop. NIPS was always pretty huge, but it was also always very academic with mathematically challenging presentations, always trying to push the envelope further. From year to year, the fashion changed, from kernel learning to non-parametric Bayes, and now back to deep learning, doing the full circle, in a way.

It’s probably inaccurate to say that NIPS was only academic, because there was always strong research labs within bigger companies, for example like the AT&T Research Lab (where Yann LeCun used to work, too), or at sites like Microsoft Research in Cambridge. There are also many who worked in the industry but were integral parts of the communtiy, like (and this is really a very short list of people who came to mind), Corrina Cortes, Samy Bengio (Google), Jon Langford (formerly Yahoo), Alex Smola (first Yahoo, then Google, now back to being professor at CMU), or people like Ralf Herbrich who moved from Microsoft Research to Facebook and now to Amazon.

At the same time, I always had the feeling that the Big Data and Data Science hype did certainly not originate at NIPS but was more like an external event to which the NIPS community had to adapt first. Part of the reason is that Big Data is a lot about infrastructure and databases, whereas NIPS always focussed more on algorithms. So while we’ve certainly been doing and thinking about large scale learning, we did it not through massively scaling out, but by finding algorithms which were able to deal with more data first.

Actually, if I recall correctly, back in 2006 when the paper “Map-Reduce for Machine Learning for Multicore” was presented, halfway through the talk exactly the Yann LeCun who will now head the AI lab at Facebook took the microphone to say that he didn’t consider their implementation of neural network learning properly scaled out, because they’re “just doing microbatches”. Of course he was right.

Still, the relevance of what the NIPS community was doing was undeniable, in particular because many of its members became those who engineered complex data analysis systems based on Big Data infrastructure. Just to give an example, Samy Bengio was responsible for an overhaul of Google Image search long before deep learning became mainstream again.

Last year has seen quite an investment into machine learning. I already mentioned the deep learning related acquisitions. Then there is also the new machine learning lab in Berlin headed by Ralf Herbrich and now the new AI lab at Facebook. Then there are smaller companies like Berlin based Zalando (a fashion retailer) who have practically hired a year of fresh Ph.D.s from our lab alone.

So Mark Zuckerberg attending is part logical consequence of what happened the last years, but also crosses a threshold because he is no longer “one of those ML guys in the back” but actually one of the most influential people of Silicon Valley.

What this means, I don’t know, but something is happening. My brain melts.

Edit Dec 11, 2013: Added small paragraph mentioning new Amazon and Facebook labs.

How Python became the language of choice for data science

Nowadays Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools (just look at the list at the top of this article).

However, it wasn’t always like this. In fact, when I started working on my Ph.D. back in 2000 virtually everyone was using matlab for this. And again, rightly so. Matlab was very well suited to quickly prototype linear algebra and matrix stuff, came with a nice set of visualizations, and even allowed to do some text mining and file parsing if you really needed it to do so.

The problem was, however, that matlab was and is actually very expensive. A single license costs a few thousand Euros, and each toolbox costs another few thousand Euros. However, matlab was always very cheap for universities, which made perfect sense: That way, students could be trained in matlab so that they already knew how to use matlab to solve problems and companies would then be willing to pay for the licenses.

All of this changed significantly in 2005 or so. At that time I was working at the Fraunhofer Institute FIRST, which belongs to a group of German publicly funded research institutes focused on applied research. Originally, Fraunhofer institutes could get the same cheap licenses, but then Mathworks changed their policies to the effect that you could only get the university rate if you are an institution which hands out degrees.

This did not hold for most publicly funded research institutes all over the world, like the Max-Planck-Institutes (like the one in Tübingen where Bernhard Schölkopf is), or the NICTA in Australia where Alex Smola and others were working at the time. So we decided something had to change and we started looking for alternatives.

Python was clearly one of the possible choices, but at the time other opportunities seemed possible as well. For example, octave had been around for a long time and people wondered whether one should not just help them to make octave as good as matlab and fix all remaining compatibility issues. Together with Stefan Harmeling I started phantasizing about a new programming language dubbed rhabarber which would allow to extend even the syntax dynamically to be able to have true matrix literals (or even other things). Later I would play around with JRuby as a basis because it allowed better integration with Java to write high performance code where necessary (instead of doing painful low-level stuff with C and swig).

If I remember correctly, the general consensus was already back then that Python would the language of choice. I think early versions of numpy already existed, as well as early versions of matplotlib. Shogun, which had been developed and used extensively in our lab, had already begun to provide Python bindings, and so on.

I personally always felt (and still feel) that there are things where Matlab is still superior to Python. Matlab was always a quite dynamic environment because you could edit files and it would reload the files automatically. Python is also somewhat restrictive with what you can say on a single line. In Matlab you would often load some data, start editing the functions and build you data analysis step by step, while in Python you tend to have files which you start from the command line (or at least that’s how I tend to do it).

In any case, early on there was also the understanding that we should focus our efforts on a single project and not have the work scattered over several independent projects, so we planned a workshop at NIPS 2005 on this, but unfortunately the workshop was rejected. However, engagement was so high, that we just rented a seminar room in the same hotel where NIPS was going to be held on the Sunday before the conference, notified all people we thought would be relevant and had the Machine Learning Tools Satellite Workshop the day before the NIPS conference.

The hot contender back then was the Elefant toolbox designed by Alex Smola and collaborators, which was a pretty ambituous project. The idea was to use PETSc as the numerical back end. PETSc was developed in the area of large scale numerical simulations and had a number of pretty advanced features like distributed matrices and similar things. I think ultimately, it might have been a bit too advanced. Simple things like creating a matrix were already quite complicated.

I also gave a talk together with Stefan on rhabarber, but most people were skeptical whether a new language was really the right way to go, as Python seemed good enough. In any case, things really started to get going around that time and people were starting to build stuff based on Python. Humans are always hungry for social proof and having that one day meeting with a bunch of people from the same community gave everyone the confidence that he wouldn’t be left alone with Python.

A year later, we finally had our first Machine Learning Open Source Workshop which eventually led to the creation of the MLOSS track over at JMLR in an attempt to give scientists a better incentive to publish their software. We had several iterations of our workshop, had Travis Oliphant give an intro to numpy, invited John Hunter, the main author of matplotlib who sadly passed away last year, as well as John W. Eaton, main author of octave, and also have new workshop at this years NIPS (although without me). Somehow, the big, open, interoperable framework didn’t emerge, but we’re still trying. Instead there exist many framework which are wrapping the same basic algorithms and tools again and again.

Eventually, Elefant didn’t make the race, but other toolboxes like scikit- learn became common place, and nowadays we luckily have a large body of powerful tools to work with data, without having to pay horrenduous licensing fees. Other tools like Pandas were created in other communities and everything came together nicely. I think it’s quite a success story and having been minor part of it is nice, although I didn’t directly contribute in terms of software.

Interestingly, I never became that much of a Python enthusiast. I wrote my own stuff in JRuby, which lead to the development of jblas, but at some point started working on real-time analysis stuff where I just needed better control over my data structures. So nowadays I’m doing most my work in Scala and Java. Visualization is one area where there is really little alternatives besides Python and probably R. Sure, there’s D3.js but it’s fairly low-level. I still have dreams of a plotting library where the type of visualization is decoupled from the data (such that you can say “plot this matrix as a scatter plot, or as an image”). Maybe I’ll find the time at some point.

So if you have stories to share (or corrections) on the “early years of Data Science”, I’d love to hear from you.