Just to get this out of the way first, no, I’m not going to talk about Weka, and also not about Mahout. I’m talking about an environment similar to matlab or scipy where you get a command line to do some data analysis. The basic idea is to built something similar using one of the available scripting languages for the JVM.
Ever since the Machine Learning Tools Satellite Workshop in 2005 (website seems to be gone for good), I’ve been interesting in an open source alternative to matlab, which is based on a modern scripting language such that you can structure your code better. For the last few years, I’ve been interested in building something like this based on the Java Virtual Machine. Here is what I’ve learned so far. Unfortunately, there is no clear solution yet. As this is probably going to be quite long, I’ll split the article over several posts. In this article, I’ll talk about why I think one should try to build something like that on the JVM.
Okay, why would you want to do this? After all, scipy has been around for quite some while and comes with an excellent plotting library called matplotlib.
First of all, there is the issue of speed. As long as you work with the already existing matrix libraries, everything is fine, but when you are starting to work with a data type for which a native library isn’t available, performance will be a huge issue.
Of course, you can always go ahead and write your own native extension to, for example, python. But even if you use a tool like swig to take care of the tedious job of generating wrapper code, this process is still quite complicated, as you have to think about garbage collection, mapping python types to C types, and so on. For machine learning people who are not computer scientists by training, this step might be too hard.
On the other hand, for most scripting languages for the JVM like JRuby, or Groovy, accessing Java code is straightforward. This still means that you have to master two languages, but the interface between the two languages is direct, and you don’t have to worry about garbage collection, or mapping types, or wrapper code.
And for those who still believe that “Java is always slower than C by a factor of two”, be told that it does not hold as a general rule anymore. When it comes to for-loops just performing some numerical computations, Java is more or less on par with C (see my post on some benchmark for matrix operations), although array access is safe in Java (ok, unless you start to do some loop unrolling by hand, and so on). On the other hand, the JVM probably comes with one of the best garbage collectors in existence which can literally run for weeks without experiencing memory leaks, or other kind of problems.
Next, there is the issue of multicore. Most processors sold today come with more than one core, opening up the possibility of gaining better performance by parallelizing code. Most scripting languages, including python have only limited support for true multicore. Python has native threads, but also the so-called Global Interpreter Lock, meaning that only a single thread can run in the interpreter at any given time. On the other hand, the Java virtual machine fully supports multicore programming. As a consequence, newer paradigms for multicore programming like the actor model, or software transactional memory, are not available in python (or matlab, for the matter), but on the JVM (see for example, akka, or clojure’s refs).
Finally, there is really an incredible amount of high-class software projects for the JVM in particular in enterprise related areas like databases, networking, web frameworks, or infrastructure. So if you eventually need to interface to any existing enterprise environment, chances are very high that it will be running on the JVM.
There are probably more reasons, but for me, the main ones are better scripting languages (compared to matlab), easier extensions in Java, multicore programming, performance and stability, and existing libraries.
In the next post I’ll discuss the basic ingredients necessary.
Posted by MIkio L. Braun at 2010-04-08 15:03:00 +0000