Wednesday, November 20, 2013
How Python became the language of choice for data science
Nowadays Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools (just look at the list at the top of this article).
However, it wasn’t always like this. In fact, when I started working on my Ph.D. back in 2000 virtually everyone was using matlab for this. And again, rightly so. Matlab was very well suited to quickly prototype linear algebra and matrix stuff, came with a nice set of visualizations, and even allowed to do some text mining and file parsing if you really needed it to do so.
The problem was, however, that matlab was and is actually very expensive. A single license costs a few thousand Euros, and each toolbox costs another few thousand Euros. However, matlab was always very cheap for universities, which made perfect sense: That way, students could be trained in matlab so that they already knew how to use matlab to solve problems and companies would then be willing to pay for the licenses.
All of this changed significantly in 2005 or so. At that time I was working at the Fraunhofer Institute FIRST, which belongs to a group of German publicly funded research institutes focused on applied research. Originally, Fraunhofer institutes could get the same cheap licenses, but then Mathworks changed their policies to the effect that you could only get the university rate if you are an institution which hands out degrees.
This did not hold for most publicly funded research institutes all over the world, like the Max-Planck-Institutes (like the one in Tübingen where Bernhard Schölkopf is), or the NICTA in Australia where Alex Smola and others were working at the time. So we decided something had to change and we started looking for alternatives.
Python was clearly one of the possible choices, but at the time other opportunities seemed possible as well. For example, octave had been around for a long time and people wondered whether one should not just help them to make octave as good as matlab and fix all remaining compatibility issues. Together with Stefan Harmeling I started phantasizing about a new programming language dubbed rhabarber which would allow to extend even the syntax dynamically to be able to have true matrix literals (or even other things). Later I would play around with JRuby as a basis because it allowed better integration with Java to write high performance code where necessary (instead of doing painful low-level stuff with C and swig).
If I remember correctly, the general consensus was already back then that Python would the language of choice. I think early versions of numpy already existed, as well as early versions of matplotlib. Shogun, which had been developed and used extensively in our lab, had already begun to provide Python bindings, and so on.
I personally always felt (and still feel) that there are things where Matlab is still superior to Python. Matlab was always a quite dynamic environment because you could edit files and it would reload the files automatically. Python is also somewhat restrictive with what you can say on a single line. In Matlab you would often load some data, start editing the functions and build you data analysis step by step, while in Python you tend to have files which you start from the command line (or at least that’s how I tend to do it).
In any case, early on there was also the understanding that we should focus our efforts on a single project and not have the work scattered over several independent projects, so we planned a workshop at NIPS 2005 on this, but unfortunately the workshop was rejected. However, engagement was so high, that we just rented a seminar room in the same hotel where NIPS was going to be held on the Sunday before the conference, notified all people we thought would be relevant and had the Machine Learning Tools Satellite Workshop the day before the NIPS conference.
The hot contender back then was the Elefant toolbox designed by Alex Smola and collaborators, which was a pretty ambituous project. The idea was to use PETSc as the numerical back end. PETSc was developed in the area of large scale numerical simulations and had a number of pretty advanced features like distributed matrices and similar things. I think ultimately, it might have been a bit too advanced. Simple things like creating a matrix were already quite complicated.
I also gave a talk together with Stefan on rhabarber, but most people were skeptical whether a new language was really the right way to go, as Python seemed good enough. In any case, things really started to get going around that time and people were starting to build stuff based on Python. Humans are always hungry for social proof and having that one day meeting with a bunch of people from the same community gave everyone the confidence that he wouldn’t be left alone with Python.
A year later, we finally had our first Machine Learning Open Source Workshop which eventually led to the creation of the MLOSS track over at JMLR in an attempt to give scientists a better incentive to publish their software. We had several iterations of our workshop, had Travis Oliphant give an intro to numpy, invited John Hunter, the main author of matplotlib who sadly passed away last year, as well as John W. Eaton, main author of octave, and also have new workshop at this years NIPS (although without me). Somehow, the big, open, interoperable framework didn’t emerge, but we’re still trying. Instead there exist many framework which are wrapping the same basic algorithms and tools again and again.
Eventually, Elefant didn’t make the race, but other toolboxes like scikit- learn became common place, and nowadays we luckily have a large body of powerful tools to work with data, without having to pay horrenduous licensing fees. Other tools like Pandas were created in other communities and everything came together nicely. I think it’s quite a success story and having been minor part of it is nice, although I didn’t directly contribute in terms of software.
Interestingly, I never became that much of a Python enthusiast. I wrote my own stuff in JRuby, which lead to the development of jblas, but at some point started working on real-time analysis stuff where I just needed better control over my data structures. So nowadays I’m doing most my work in Scala and Java. Visualization is one area where there is really little alternatives besides Python and probably R. Sure, there’s D3.js but it’s fairly low-level. I still have dreams of a plotting library where the type of visualization is decoupled from the data (such that you can say “plot this matrix as a scatter plot, or as an image”). Maybe I’ll find the time at some point.
So if you have stories to share (or corrections) on the “early years of Data Science”, I’d love to hear from you.
Posted by Mikio L. Braun at 2013-11-20 17:20:00 +0100
Some Books related to Python and Data Analysis
Here are few handpicked books you might find interesting. Learning Python is a pretty comprehensive book on Python, probably too much if you're just interested in data analysis. Python for Data Analysis deals with all the main libraries, including Pandas and matplotlib. The IPython book is by the creators of IPython themselves, so buying it will hopefully give them some support, too. Finally, Visualize this is a general book on different kinds of data visualization, and deals not only with Python but other tools as well.