How Python became the language of choice for data science
Nowadays Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools (just look at the list at the top of this article).
However, it wasn't always like this. In fact, when I started working on my Ph.D. back in 2000 virtually everyone was using matlab for this. And again, rightly so. Matlab was very well suited to quickly prototype linear algebra and matrix stuff, came with a nice set of visualizations, and even allowed to do some text mining and file parsing if you really needed it to do so.
The problem was, however, that matlab was and is actually very expensive. A single license costs a few thousand Euros, and each toolbox costs another few thousand Euros. However, matlab was always very cheap for universities, which made perfect sense: That way, students could be trained in matlab so that they already knew how to use matlab to solve problems and companies would then be willing to pay for the licenses.
All of this changed significantly in 2005 or so. At that time I was working at the Fraunhofer Institute FIRST, which belongs to a group of German publicly funded research institutes focused on applied research. Originally, Fraunhofer institutes could get the same cheap licenses, but then Mathworks changed their policies to the effect that you could only get the university rate if you are an institution which hands out degrees.
This did not hold for most publicly funded research institutes all over the world, like the Max-Planck-Institutes (like the one in Tübingen where Bernhard Schölkopf is), or the NICTA in Australia where Alex Smola and others were working at the time. So we decided something had to change and we started looking for alternatives.
Python was clearly one of the possible choices, but at the time other opportunities seemed possible as well. For example, octave had been around for a long time and people wondered whether one should not just help them to make octave as good as matlab and fix all remaining compatibility issues. Together with Stefan Harmeling I started phantasizing about a new programming language dubbed rhabarber which would allow to extend even the syntax dynamically to be able to have true matrix literals (or even other things). Later I would play around with JRuby as a basis because it allowed better integration with Java to write high performance code where necessary (instead of doing painful low-level stuff with C and swig).
If I remember correctly, the general consensus was already back then that Python would the language of choice. I think early versions of numpy already existed, as well as early versions of matplotlib. Shogun, which had been developed and used extensively in our lab, had already begun to provide Python bindings, and so on.
I personally always felt (and still feel) that there are things where Matlab is still superior to Python. Matlab was always a quite dynamic environment because you could edit files and it would reload the files automatically. Python is also somewhat restrictive with what you can say on a single line. In Matlab you would often load some data, start editing the functions and build you data analysis step by step, while in Python you tend to have files which you start from the command line (or at least that's how I tend to do it).
In any case, early on there was also the understanding that we should focus our efforts on a single project and not have the work scattered over several independent projects, so we planned a workshop at NIPS 2005 on this, but unfortunately the workshop was rejected. However, engagement was so high, that we just rented a seminar room in the same hotel where NIPS was going to be held on the Sunday before the conference, notified all people we thought would be relevant and had the Machine Learning Tools Satellite Workshop the day before the NIPS conference.
The hot contender back then was the Elefant toolbox designed by Alex Smola and collaborators, which was a pretty ambituous project. The idea was to use PETSc as the numerical back end. PETSc was developed in the area of large scale numerical simulations and had a number of pretty advanced features like distributed matrices and similar things. I think ultimately, it might have been a bit too advanced. Simple things like creating a matrix were already quite complicated.
I also gave a talk together with Stefan on rhabarber, but most people were skeptical whether a new language was really the right way to go, as Python seemed good enough. In any case, things really started to get going around that time and people were starting to build stuff based on Python. Humans are always hungry for social proof and having that one day meeting with a bunch of people from the same community gave everyone the confidence that he wouldn't be left alone with Python.
A year later, we finally had our first Machine Learning Open Source Workshop which eventually led to the creation of the MLOSS track over at JMLR in an attempt to give scientists a better incentive to publish their software. We had several iterations of our workshop, had Travis Oliphant give an intro to numpy, invited John Hunter, the main author of matplotlib who sadly passed away last year, as well as John W. Eaton, main author of octave, and also have new workshop at this years NIPS (although without me). Somehow, the big, open, interoperable framework didn't emerge, but we're still trying. Instead there exist many framework which are wrapping the same basic algorithms and tools again and again.
Eventually, Elefant didn't make the race, but other toolboxes like scikit- learn became common place, and nowadays we luckily have a large body of powerful tools to work with data, without having to pay horrenduous licensing fees. Other tools like Pandas were created in other communities and everything came together nicely. I think it's quite a success story and having been minor part of it is nice, although I didn't directly contribute in terms of software.
Interestingly, I never became that much of a Python enthusiast. I wrote my own stuff in JRuby, which lead to the development of jblas, but at some point started working on real-time analysis stuff where I just needed better control over my data structures. So nowadays I'm doing most my work in Scala and Java. Visualization is one area where there is really little alternatives besides Python and probably R. Sure, there's D3.js but it's fairly low-level. I still have dreams of a plotting library where the type of visualization is decoupled from the data (such that you can say "plot this matrix as a scatter plot, or as an image"). Maybe I'll find the time at some point.
So if you have stories to share (or corrections) on the "early years of Data Science", I'd love to hear from you.
Comments (59)
What I like about Python in contrast to Matlab is, that Python is a general purpose PL while Matlab only tries very hard to be(come) one.
This is a huge plus if you have to write something more than isolated functions to calculate X. While other main points like performance (w/ Scipy) and learning curve are comparable, Python is still free and open source software, obviously designed by more skilled engineers, better standard library, external libraries, yada yada. Most importantly it is a *free* alternative to Matlab and it is about time to stop wasting peoples tax money on a company whose business model seems to rely on squeezing the educational institutions as hard as possible for money.
Fully agree with what you said, although there are a number of programming languages out there which are good general purpose languages and open source. I think what really distinguishes Python today is the very rich set of available tools around data analysis.
As I said, the one thing Matlab was good is was to interactively work with data. When loading your data already takes a minute or so, it's very nice to have the data in memory and fiddle with the code without having to reload the data each time to see whether it works. Of course, this is only possible because Matlab is a very simple programming language (mostly functions). I think technically, reloading is possible in a number of languages (in Ruby I know it is, because you can extend classes afterwards), don't know how it would work in Python, but then it gets complicated because of the complexity of a real OO language... .
As I said elsewhere, I really don't regret that Python became the language of choice, just wanted to tell the story how it was in the beginning ;)
I work with python every day to interactively load and process data. I use functionality inside eclipse or pycharm to interact with ipython. It retains all data on memory, allow you to delete variables if needed, re run scripts, reload modules you just re-coded and do all this in one interactive session. The only issue I have had is memory consumption. After a while ram seems to grow even if you do not allocate more memory, and trying forcing a gc does not often help as you can only ask nicely. But if I understand you, use python insidea repl such as I python and you will have what you need. It will even let you run other script files and load and editor to write some code that you can then dump into the current session.
Looks interesting, thanks for the pointer. ipython notebooks didn't exist back then, looks like they filled this important (IMHO) hole in their feature set.
Note only that, but the rendering of iPython Notebook files (*.ipynb) makes it amazing to share analysis. For example, here is the rendered Seaborn Statistics Graphing example : http://nbviewer.ipython.org...
And the more powerful thing, is to download that ipynb file... then run `$> ipython notebook --pylab inline` in the same directory as the downloaded file... and you can directly run their code, in-order and view their results re-rendered.
Re-creatable / sharable data analysis.
Curious, but doesn't IPython Notebook do what you want w/ loading the data once (everything runs in a single kernel) and then working with it? This should work as long as you don't restart the kernel.
You can redefine anything in your notebooks of course, but you can also use an extension to autoreload modules (or manually del from sys.modules and calling reload() and then reimporting)? Or are you trying to do something else?
The package you're hoping for that does "plot this matrix as a scatter plot" exists. It's called ggplot2 -- and unfortunately, it's a package for R.
It's actually pretty easy to integrate R with Python though (using IPython), and in my opinion it's definitely worth it.
There was recently a ggplot2 port for Python: https://github.com/yhat/ggp...
Looks very promising.
Thanks for letting me know. Some of my co-workers are big fans of ggplot2, will definitely have a look. Still, I'm more hoping in the direction of an interactive data visualization tool. Command line is probably the wrong metaphor here, I was thinking more of something like Google maps for data ;)
I think IPython (especially in notebook form) is what you need here
Sounds like a great idea for a product. "Interactive data visualization and analysis tool" from the brilliant Braun!
For an interactive data visualization tool, I've been impressed with Plot.ly: https://plot.ly/
fdsafsafasds
Sanketh already mentioned ggplot.py, but there's also Seaborn: https://github.com/mwaskom/...
We are also very close to having exactly what the author is looking for, directly from Python, in Bokeh: http://bokeh.pydata.org
You can check Bokeh which closes a visualization gap for Python: http://bokeh.pydata.org/
You mentioned Scala. What data science toolkit would you use with scala?
Mostly my own handwritten stuff. Don't know of any existing toolkit you could use out of the box. There are a few like mlbase, scalalab, and breeze (https://github.com/dlwh/bre... by I don't have experience with those.
I would use Saddle, made by one of the authors of the Pandas library. Very good, very fast.
I like Python and R
"At some point, I started working on real-time analysis stuff where I just
needed better control over my data structures. So nowadays I’m doing
most my work in Scala and Java."
I am interested to know what makes you reject Python for that, while you already seem to use it for data analysis ?
Python doesn't have very fine-grained control over the way you store your data in advanced data structures like trees, heaps, hash maps, etc. It basically has only lists (which are really more like arrays), and hash maps, and numerical arrays through numpy. Implementing more advanced stuff, even linked lists, won't give you the performance you need and I didn't want to implement it in C just so I can keep on using Python. Java comes with a very mature set of collection classes which let you do all that and is fast enough for my needs.
Via cython you have access to the C++ stl as python/cython classes. The C++ stl has many of this convenient data structures, and you can extend it if you need it (for instance the C++ stl does not have the hash-map, but somebody already implemented it).
I think that the great power of Python and R are its "hability" to become easily integrated with any other kind of software, whatever the language is. The strength of R is its similarity with S, and thus, its community. However, Python is by far a much better language to code with than R, and it will prevail as the standard choice in the long run. As somebody already wrote above. Python is not the best in something, but it is good in almost anything, and that is a huge power! For instance, try to do symbolic calculation with Java, or R! Python is already beating Wolfram Mathematica in many subjects regarding symbolic calculation.
You'll find an echo to your own blog post here: http://www.talyarkoni.org/b...
Two great reads in the same week.
Actually, I know that article and also link to it in the first paragraph. Probably had something to do with my writing up this article ;)
Page was a tad slow - ah the (death)cuddle of Reddit
Luckily this seldom happens ;) On days like this I'm glad that the block is just a bunch of static files. ;)
You missed julia: http://julialang.org/
Not only julia, but as other's pointed out there's clojure, and even Haskell. Exciting times. But if the story around Python has shown us anything, its that the infrastructure and set of available toolkits is at least as important as the language.
I think Julia has it all... except for the toolkits. Indeed it is the toolkits even more than the infrastructure that makes a language attractive to data science practitioners
Thanks for sharing your story. This is an exciting time for the Python & Data universe.
You're welcome!
Python is an all-around B+ language. That's not a knock, because that's hard to do. This is a field where even a 'C' grade would refer to something that took a lot of work by some really competent people. Python's not great at anything, but it's good at everything, and that has been a huge win for it. That has also enabled it to build a massive and impressive community.
Where I see a lot of potential, looking forward, is in Clojure. It has the interactivity that a data scientist demands, the easy interoperability with the Java libraries, and the full power of Lisp (because it is a Lisp).
One of the things that I saw recently being said about Python is that "Python is generally the second-best language for everything". Of course, I think it's the best language for some things. :-)
I feel as though IPython Notebook is increasingly lowering the bar for entry into data science with Python. As simple as it is for some of us to use Python in REPL or scripts, the concept of "executable paper" with seamless notes and code integration is really going to help take it even more mainstream. Just point your web browser to a URL and start working. (No SSH, no command line, etc.) Mix up your code and notes as needed, and then trivially export and share with others. It's a simple idea, but a great one.
Great work on jblas. Making Java play well with BLAS is not trivial.
Thanks, Dave, yeah, some things you only want to do once in your lifetime.
Hahaha! I did my time on that one!
All good, but where are the publications?
Which publications?
If you want some notion of the hurdles involved, there is a (well out of date) jlapack paper downloadable from academia.edu.
Hi! I really liked this post, i’m in trouble in this topic. I like the Ruby language more then Python, or everything else. I’ve searched on internet, and i find that there are 2 languages with w mature ecosystem in scientific research: Java and Python. So i would give a try for JRuby. What do you think about that? The only thing i think i could miss -compared to python- is the speed due Cython. Am I wrong? Could JRuby be faster, or just FAST ENOUGH? Or can I mix JRuby with C? Anyway, if speed would be in first place, why there is not good enough c or c++ libs? Thanks for your answer
If you went with JRuby, you wouldn't want to mix it with C, but just use Java for stuff which needs to be fast, and this would definitely be FAST ENOUGH ;)
Here https://github.com/mikiobra... you can still find some wrapper for jblas from ruby. Not sure whether these still work, but it's definitely worth a try!
Thanks for your post. Very interesting! I still use almost exclusively matlab. My attempts with Python a couple of years ago were good, but often not good enough. I still like matlab's simplicity of having functions in files inside folders as the namespace that I can directly edit. This defines very clearly the state of the system (together with 'whos'). Nonetheless, I like the mathematica's notebook idea for interaction, so I should try ipython notebook. And of course keep on phantasizing about new languages ;) I guess in rhabarber we wanted to put in all the cool stuff we could dream off. Next time we should try to come up with the simplest possible language with only few features...
Hi Stefan! Nice to have you here ;)
Yeah, we were much younger back then, weren't we? ;)
1. python is free, and is being improved continuously 2. matlab is damn too expensive, and stops improving many years ago.
Thank you for writing this. I enjoyed reading it and following your links. You are good at writing your thoughts out in an interesting way.
> The IPython book is by the creators of IPython themselves, so buying it will hopefully give them some support, too.
Any more details on this book to locate it for purchase?
Actually, there are few amazon links in that box. You might have to whitelist my site on AdBlock for that maybe ;) (Which is safe, no banner ads so far)
How do you guys live with the non-strongly typed languages like Python and Ruby? I am doing all my data analysis in Java...
Why? Because when working on a large problem, you build a large system, and then strong-typing and refactoring becomes very important for being able to keep the overall project manageable... The compiler can do so much to ensure there are no mistakes.
I remember as soon as my project grew to more than a few files in Matlab, for example, it was such a pain to continue improving it from the programming point of view... How do you manage to keep large projects together with Python etc?
I personally also very much favor building larger systems (or systems which require high performance) in typed languages. Some people find the mandatory type information annoying, but I usually find that it also helps to clarify the structure of the data in one's mind. Besides it's also useful documentation, in particular compared to passing around hand-built maps and arrays in dynamic languages.
That aside, I think for interactive exploration and prototyping, dynamic languages are pretty perfect, in particular if there's a strong community which provides all kinds of visualization toolsets, etc.
But I think as soon as you move to production you should switch to a compiled, typed programming langauge.
I agree with you. So it looks like the whole Python infrastructure is basically aiming at creating an open source version of Matlab. Almost there actually. However, I was hoping that Python would actually allow moving research code into production much smoother, so that you wouldn't have to re-implement things...
I realize this is an old thread, but just wanted to say I agree with this. If the final system is going to be implemented in a compiled language anyways, it seems like extra overhead to first write everything in Python, then rewrite it in e.g. Java and have to check all the functionality is working as expected from scratch.
I think you mix up something, Python is strongly typed. Even more (in contrast to Java) when you have a container (e.g. list) of arbitrary objects, you know precisely what each element's types are -- in Java you have to guess around.
The key technique you are probably missing is automated testing. E.g. there is a good reason why all larger python libs in github have a ".travis.yaml" file :-)
Hi Harald,
I tend to disagree with you on the "having to guess around part" in Java. Just like in Python you can get the type of each object at runtime.
The main difference between a language like Python and Java is that Java is statically typed, so there is some amount of type checking done at compile time which allows you to catch some errors. Also, type annotations on method parameters gives you some extra documentation.
What readrz was talking about was that that extra information helps IDEs to refactor your tool, or spot some errors early on.
Now it's not as clear cut as some people want to have it, of course. Even with static typing you can get runtime errors, and just because you are not using static typing does not mean everything will deteriorate into a mess. And yes, unit tests are a huge step towards ensuring that.
On the other hand, when writing large systems, having type annotation helps, just as unit testing.
-M
I've been using SPSS Statistics / Modeller for data prep, analysis and modelling. This tool is awesome for data science. Saves a lot of time. However, I feel for people learning data analysis, using R and Python(pandas, ipython, numpy, etc...) is better because it makes you think more about whats happening under the hood.
Python has an extremely rich and healthy ecosystem of data science tools.
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms. More at https://intellipaat.com/pyt...
I visualize all my data with creately. Its a online diagramming and collaboration app.
Python is cool
I am working in Human resource. I have to enter into data science.Can you please suggest me what should be the very first step and also I have to start learning Python.Please suggest me books and material.
I am a beginner.
Thanks in advance.