Marginally Interesting: Command Line Interactive Machine Learning on the JVM. Part 2: JRuby and Scala

This is Part 2 of a series. Previous post is here.

In order to built an interactive environment like scipy on the JVM, we need the following basic ingredients:

an interactive shell and a simple enough scripting language to do some rapid prototyping.
a fast numeric and matrix library
a plotting tool

Concerning the matrix types, there exist a number of matrix libraries, both in pure Java and based on native code which provide the required functionality. See the pointers of the java-matrix-benchmark project, as well as my own jblas library.

In this post, I’d like to concentrate on point 1. I’d like to discuss two alternatives which fit the requirements quite well. Which one you will take probably depends a lot on what you actually want to do, what you background is, and whether there are some libraries you are particularily interested in.

JRuby - dynamic scripting

JRuby is the reimplementation of the Ruby language in Java. Ruby is probably best known for the Ruby on Rails web framework, which completely changed the expectation for web frameworks through its simplicity. Ruby is a dynamically typed object-oriented scripting language similar to Python, meaning that you do not specify the types of your variables and whether an operation works or not is only checked when the program is actually run.

I prefer Ruby to Python on the JVM for two reasons: JRuby is a much more active project. Its main programmer, Charles Nutter, is incredibly productive. Jython on the other hand also seems pretty mature, but somehow lacks that extra drive. The other reason is that Ruby is more expressive, in particular on the command line. Not every Python expression can be reformatted on a single line, which is probably good for readability of source code, but becomes a problem when you’re trying things out at the command line.

One nice aspect of JRuby is that you can add methods to Java classes. That can be used to make Java classes more Ruby-like, or to add operators as syntactic sugar.

Finally, another very interesting aspect of JRuby is that it is closely tied to the Ruby community, which has its own pecularities, giving hardcore Java programmers a fresh new perspective, for example, on how to design APIs. Ruby projects generally tend to have a simpler interfaces (and unfortunately insufficient documentations, too), while Java APIs often force you to remember a few dozen classes to even do simple things.

I’ve been playing around with my own little project to build a shell around JRuby called Marge which I hope to release at some point. It is based on my JRuby wrapper to jblas, and also supports other interesting features, like automatic reloading of files which have changed (the fact that Ruby is interpreted adds some extra flexibility here).

Still, the downside of JRuby is that it might still be not fast enough when it comes to number crunching. Although JRuby is eventually compiled to Java bytecode, the code is still more or less dynamically typed, meaning that you won’t get as fast as pure Java code. Finally, compiling JRuby code to real Java classes (taking “normal” Java arguments) is a feature under development, so that you cannot easily reuse components written in JRuby from Java.

This brings us to the next language, Scala.

Scala - one to rule them all?

Scala is something like an Uber-Java which at the same time tries to be more script-like. It adds a bit of type inference, such that you don’t have that tedious repetition of type information as in Java (Map<String,String> m = new HashMap<String,String>()), while adding a lot more flexibility. For example, Scala has mix-ins (interfaces with partial implementations), operator overloading, and closures (anonymous functions). In addition Scala also supports a more functional programming style emphasizing immutable data, which makes concurrent programming much easier in some cases. Scala also provides implicit conversions, even for the receiver of a method call, which basically gives you the ability to add methods to existing classes, just as in Ruby.

An important design choice of Scala is that it comes both with a command-line mode (with incredibly slow startup times) and a script mode, which is a special way to parse code with relaxed rules. This allows to quickly play around with some ideas without having to set up a multi-file project as you would have in Java.

For me, the most interesting aspect for Scala as the basis for an environment for doing machine learning is that one language might be all you need from the high-level scripting down to the actual number crunching. Scala tries to use primitive types wherever possible to support this. Although there still might be some work necessary, this is a very interesting aspect.

There already exists a project called Scalala which provides some matrix capabilities. As far as I can tell, they rely on pure Scala, and won’t be as fast as, for example, my own jblas.

Other options

There exist many more scripting languages for the JVM, each with their own community and specialities.

Groovy is a language which is quite similar to Ruby. Actually, I feel that it is probably a bit too similar to Ruby to take into account once that I’ve started with JRuby. One nice feature of Groovy is that they have put a lot of work in cleaning up existing Java classes, for example, by providing consistent ways to get the size of an object (instead of choosing between length, size(), getLength(), etc.) We’re using Groovy’s Grails web framework for our twimpact site.
clojure is another language for the JVM. It is closely related to LISP, including the syntax, and support for macros. It has some nice ideas, in particular providing the software transactional memory abstraction for mutable data in concurrent programming (meaning that you basically alter data in isolated transactions). However, I never really were able to get used to the notation when programming mainly math. Still, an interesting project. Some people are using clojure for ML as well, see this post by Mark Reid, or Bradford Cross’s infer.
Project Fortress looks quite interesting in terms of numeric computation for Java, however from as much as I can tell, they are still quite early, trying out different features and so on. Certainly something to keep an eye on, though.

There exist more languages, for example, Javascript with Rhino. I think at least you would like to have good interoperability with Java, probably support for overloaded operators.

So I think there are several very nice and strong choices for scripting machine learning on the JVM. Unfortunately, some things are also still missing which I’ll cover in the next post.

Posted by Mikio L. Braun at 2010-04-12 00:00:00 +0000

Command Line Interactive Machine Learning on the JVM. Part 1: Why?

Command Line Interactive Machine Learning on the JVM. Part 3: Missing Parts