Command Line Interactive Machine Learning on the JVM. Part 1: Why?
Just to get this out of the way first, no, I'm not going to talk about Weka, and also not about Mahout. I'm talking about an environment similar to matlab or scipy where you get a command line to do some data analysis. The basic idea is to built something similar using one of the available scripting languages for the JVM.
Ever since the Machine Learning Tools Satellite Workshop in 2005 (website seems to be gone for good), I've been interesting in an open source alternative to matlab, which is based on a modern scripting language such that you can structure your code better. For the last few years, I've been interested in building something like this based on the Java Virtual Machine. Here is what I've learned so far. Unfortunately, there is no clear solution yet. As this is probably going to be quite long, I'll split the article over several posts. In this article, I'll talk about why I think one should try to build something like that on the JVM.
But why?
Okay, why would you want to do this? After all, scipy has been around for quite some while and comes with an excellent plotting library called matplotlib.
First of all, there is the issue of speed. As long as you work with the already existing matrix libraries, everything is fine, but when you are starting to work with a data type for which a native library isn't available, performance will be a huge issue.
Of course, you can always go ahead and write your own native extension to, for example, python. But even if you use a tool like swig to take care of the tedious job of generating wrapper code, this process is still quite complicated, as you have to think about garbage collection, mapping python types to C types, and so on. For machine learning people who are not computer scientists by training, this step might be too hard.
On the other hand, for most scripting languages for the JVM like JRuby, or Groovy, accessing Java code is straightforward. This still means that you have to master two languages, but the interface between the two languages is direct, and you don't have to worry about garbage collection, or mapping types, or wrapper code.
And for those who still believe that "Java is always slower than C by a factor of two", be told that it does not hold as a general rule anymore. When it comes to for-loops just performing some numerical computations, Java is more or less on par with C (see my post on some benchmark for matrix operations), although array access is safe in Java (ok, unless you start to do some loop unrolling by hand, and so on). On the other hand, the JVM probably comes with one of the best garbage collectors in existence which can literally run for weeks without experiencing memory leaks, or other kind of problems.
Next, there is the issue of multicore. Most processors sold today come with more than one core, opening up the possibility of gaining better performance by parallelizing code. Most scripting languages, including python have only limited support for true multicore. Python has native threads, but also the so-called Global Interpreter Lock, meaning that only a single thread can run in the interpreter at any given time. On the other hand, the Java virtual machine fully supports multicore programming. As a consequence, newer paradigms for multicore programming like the actor model, or software transactional memory, are not available in python (or matlab, for the matter), but on the JVM (see for example, akka, or clojure's refs).
Finally, there is really an incredible amount of high-class software projects for the JVM in particular in enterprise related areas like databases, networking, web frameworks, or infrastructure. So if you eventually need to interface to any existing enterprise environment, chances are very high that it will be running on the JVM.
There are probably more reasons, but for me, the main ones are better scripting languages (compared to matlab), easier extensions in Java, multicore programming, performance and stability, and existing libraries.
In the next post I'll discuss the basic ingredients necessary.
Comments (8)
Nice post. I have been experimenting with Jython for using Java libraries with scripting capabilities of Python. Coupled with IPython as a frontend, the experience is similar to Matlab, though with all mentioned advantages. Just my two cents. Konrad
I didn't know ipython also worked with Jython... . But I suppose plotting doesn't work, or does it?
Hi Mikio, we've been doing this in Clojure for over a year. Initially here: http://github.com/liebke/in... ... now here: http://github.com/bradford/.... I recently did some jvm matrix benchmarking and noticed you do jblas - nice work. http://measuringmeasures.co...
Hello Bradford!
Thanks for the link. I didn't know about your project, but I planned to say a few words about clojure as a scripting option.
If you want, you can register your project at http://mloss.org to make it more known to other machine learning people. It is an open directory of machine learning related open source software projects.
-M
Dr Braun, apologies for asking a question about this now when your thoughts on this have probably evolved considerably. I would tend to agree with your goals and justifications, but here are some counterclaims I might expect from the passionate and productive Python/scipy community. 1) Ctypes makes it easy to interface with native libraries (BLAS/LAPACK, FFTW, ARPACK for eigenproblems, GSL, MPFR), just as much as JNAerator for JVM languages. 2) The multiprocessing module makes it so easy to handle both multi-core and multi-node scenarios. Finally, 3) many people understand how Cython speeds up their code (the Sage folks are big proponents of this, justifiably so) and are more willing to trust it than JVM JIT black magic for obtaining nominally-native performance.
Hi Ahmed,
first of, I'm totally fine with everyone sticking to the technology he is most comfortable with. Also, the whole python infrastructure has evolved a lot, in particular in the direction of ease of adding native code, libraries for scientific computing, visualization, etc. that I consider that to be a good choice.
I've been working for quite some time now with Scala, doing real-time analysis of social media streams. So it's almost no linear algebra, but a lot of specialized data structures, which are often combinations of ordinary maps, trees, and double linked lists. Now it is true that you can probably also do that in python and C, but it's so much nicer when you don't have to drop a few language levels, just because you want to make it fast.
Also, you may think what you want about the JVM ("JIT black magic" - hehe), but in the kind of server application we run, there is just so much existing infrastructure to deploy, and monitor your system that I would always choose the JVM over Java.
Finally, you will probably disagree with me, but I've found static typing pretty helpful in getting to a clearer picture of what you want. Duck typing is a neat idea, but actually spelling out the implicit interface can be an important step in getting a better understanding of what you're doing.
In any case, stick with what you feel comfortable with, but don't believe all those people telling you JVM is evil ;)
-M
Thank you for your comments! Your insights and experience with large and heterogeneous software are invaluable to me, as a Matlab/Python/R refugee seeking a new home. You're one among many thoughtful, multi-paradigm programmers who've vouched for the JVM's quality, and I believe you! I'm seeking a corner of the JVM world where the overhead and infrastructure of the enterprise can be brought to bear when needed and completely avoided when unneeded.
(Julia looks very interesting!)
Hey Ahmed,
Glad to help. ;)
There are a few concepts and tools you need to understand, for example the notion of classpath (which is something I've never seen in another environment), or maven. It's kinda bad and wonky, but once you know how to use it (or found the right XML snippets to do what you want), it works well and everyone else is using it, too. You should also acquaint yourself with an IDE. We're using IntelliJ IDEA for some time. They have a community version which comes for free and also supports Scala (in case you want to tinker around with that ;)), but netbeans and eclipse are the other canonical choices. I wasn't a huge friend of IDE's (and still believe you should know how to do everything on the command line, at least in principle), but IDEA integrates quite nicely with maven, and even helps you setting the right classpath in many circumstances, such that you get the best of both worlds. But you should stay away from IDE-specific build things (for example, pure eclipse projects you can only build in eclipse).
There is also this question on Quora you might find helpful: http://www.quora.com/Scala/...
Good luck!
-M