This is Part 3 of a series. Part 1, Part 2.
In Part 2 I’ve discussed different options as a scripting options for a command line environment to do machine learning data analysis. In the final part, I want to mention two areas where I see most need for improvement currently.
You need some minimal editing capabilities on the command line to be productive. The most well-known project seems to be jline. It is used by practically all scripting languages on their shell, for example, JRuby, Groovy, Scala. There exists an interface to readline from Java, but GNU readline is distributed under GPL license on purpose which is quite incompatible with less restrictive licenses like BSD.
However, in its current form, JLine is quite buggy. Most importantly,
it lacks the convenient “Search Backward in History” feature which I
use a lot to find lines in the history. I and many
others
have forked from JLine to clear the code base up and add features. For
example, I’ve added the search facility (works ok on Linux, try at
your own risk ;)
, while
Jason Dillon has cleared up the code base
significantly.
Still, JLine is actually quite a hack. It uses the stty
command to
control the terminal, meaning that it integrates quite poorly with
changes of the terminal window size, or signals. On Windows, it has
the annoying bug that you cannot see the cursor as you move it around.
Some work would should be put into cleaning the code base, adding sensible terminal control and more features, but as it sort of works, nobody (including me, of course) feels the urge or has the time to really do something about this.
Concerning the plotting library, probably the most well-known is JFreeChart, but I’m not really satisfied with that library for a number of reasons: Although it is open source, you have to buy a book to get some decent documentation (javadocs are available, though). JFreeChart produces some nice plots, but I think they are closer to what you get in Excel than what matlab provides. JFreeChart also comes with its own classes for handling the data which means that you have to copy your data into those structures to display them. There are some more options, but none of them seems as feature rich as JFreeChart.
One other problem is that printing is more or less broken under Linux when you’re relying on CUPS. On my debian box, I invariably get a “No Printing Services found” error every time I try to print from any Java program. There are also some bugs which haven’t been fixed in years. The bottom line is that you cannot really rely on the built in printing capabilities of Java to generate plots for your paper - which is really a shame.
Other options probably are to use a SVG library like batik, or switch to pure Javascript graphics libraries like Raphaël or processing.js to do the plotting inside a web browser.
So in summary, there are two main missing features: A feature rich, stable readline replacement, and a flexible plotting solution which also prints.
I haven’t talked about this at all until now, but of course there are also already several machine learning toolboxes in Java or other JVM related languages. Of course, these projects are more or less ignorant of one another, yet, so more work would be require to write some common interfaces. Here is just a short list to get you started, also look at mloss.org
Don’t hesitate to post more links in the comments!
This is Part 2 of a series. Previous post is here.
In order to built an interactive environment like scipy on the JVM, we need the following basic ingredients:
Concerning the matrix types, there exist a number of matrix libraries, both in pure Java and based on native code which provide the required functionality. See the pointers of the java-matrix-benchmark project, as well as my own jblas library.
In this post, I’d like to concentrate on point 1. I’d like to discuss two alternatives which fit the requirements quite well. Which one you will take probably depends a lot on what you actually want to do, what you background is, and whether there are some libraries you are particularily interested in.
JRuby is the reimplementation of the Ruby language in Java. Ruby is probably best known for the Ruby on Rails web framework, which completely changed the expectation for web frameworks through its simplicity. Ruby is a dynamically typed object-oriented scripting language similar to Python, meaning that you do not specify the types of your variables and whether an operation works or not is only checked when the program is actually run.
I prefer Ruby to Python on the JVM for two reasons: JRuby is a much more active project. Its main programmer, Charles Nutter, is incredibly productive. Jython on the other hand also seems pretty mature, but somehow lacks that extra drive. The other reason is that Ruby is more expressive, in particular on the command line. Not every Python expression can be reformatted on a single line, which is probably good for readability of source code, but becomes a problem when you’re trying things out at the command line.
One nice aspect of JRuby is that you can add methods to Java classes. That can be used to make Java classes more Ruby-like, or to add operators as syntactic sugar.
Finally, another very interesting aspect of JRuby is that it is closely tied to the Ruby community, which has its own pecularities, giving hardcore Java programmers a fresh new perspective, for example, on how to design APIs. Ruby projects generally tend to have a simpler interfaces (and unfortunately insufficient documentations, too), while Java APIs often force you to remember a few dozen classes to even do simple things.
I’ve been playing around with my own little project to build a shell around JRuby called Marge which I hope to release at some point. It is based on my JRuby wrapper to jblas, and also supports other interesting features, like automatic reloading of files which have changed (the fact that Ruby is interpreted adds some extra flexibility here).
Still, the downside of JRuby is that it might still be not fast enough when it comes to number crunching. Although JRuby is eventually compiled to Java bytecode, the code is still more or less dynamically typed, meaning that you won’t get as fast as pure Java code. Finally, compiling JRuby code to real Java classes (taking “normal” Java arguments) is a feature under development, so that you cannot easily reuse components written in JRuby from Java.
This brings us to the next language, Scala.
Scala is something like an Uber-Java which at
the same time tries to be more script-like. It adds a bit of type
inference, such that you don’t have that tedious repetition of type
information as in Java (Map<String,String> m = new
HashMap<String,String>()
), while adding a lot more flexibility. For
example, Scala has mix-ins (interfaces with partial implementations),
operator overloading, and closures (anonymous functions). In addition
Scala also supports a more functional programming style emphasizing
immutable data, which makes concurrent programming much easier in some
cases. Scala also provides implicit conversions, even for the receiver
of a method call, which basically gives you the ability to add methods
to existing classes, just as in Ruby.
An important design choice of Scala is that it comes both with a command-line mode (with incredibly slow startup times) and a script mode, which is a special way to parse code with relaxed rules. This allows to quickly play around with some ideas without having to set up a multi-file project as you would have in Java.
For me, the most interesting aspect for Scala as the basis for an environment for doing machine learning is that one language might be all you need from the high-level scripting down to the actual number crunching. Scala tries to use primitive types wherever possible to support this. Although there still might be some work necessary, this is a very interesting aspect.
There already exists a project called Scalala which provides some matrix capabilities. As far as I can tell, they rely on pure Scala, and won’t be as fast as, for example, my own jblas.
There exist many more scripting languages for the JVM, each with their own community and specialities.
Groovy is a language which is quite similar to Ruby. Actually, I feel that it is probably a bit too similar to Ruby to take into account once that I’ve started with JRuby. One nice feature of Groovy is that they have put a lot of work in cleaning up existing Java classes, for example, by providing consistent ways to get the size of an object (instead of choosing between length, size(), getLength(), etc.) We’re using Groovy’s Grails web framework for our twimpact site.
clojure is another language for the JVM. It is closely related to LISP, including the syntax, and support for macros. It has some nice ideas, in particular providing the software transactional memory abstraction for mutable data in concurrent programming (meaning that you basically alter data in isolated transactions). However, I never really were able to get used to the notation when programming mainly math. Still, an interesting project. Some people are using clojure for ML as well, see this post by Mark Reid, or Bradford Cross’s infer.
Project Fortress looks quite interesting in terms of numeric computation for Java, however from as much as I can tell, they are still quite early, trying out different features and so on. Certainly something to keep an eye on, though.
There exist more languages, for example, Javascript with Rhino. I think at least you would like to have good interoperability with Java, probably support for overloaded operators.
So I think there are several very nice and strong choices for scripting machine learning on the JVM. Unfortunately, some things are also still missing which I’ll cover in the next post.
Just to get this out of the way first, no, I’m not going to talk about Weka, and also not about Mahout. I’m talking about an environment similar to matlab or scipy where you get a command line to do some data analysis. The basic idea is to built something similar using one of the available scripting languages for the JVM.
Ever since the Machine Learning Tools Satellite Workshop in 2005 (website seems to be gone for good), I’ve been interesting in an open source alternative to matlab, which is based on a modern scripting language such that you can structure your code better. For the last few years, I’ve been interested in building something like this based on the Java Virtual Machine. Here is what I’ve learned so far. Unfortunately, there is no clear solution yet. As this is probably going to be quite long, I’ll split the article over several posts. In this article, I’ll talk about why I think one should try to build something like that on the JVM.
Okay, why would you want to do this? After all, scipy has been around for quite some while and comes with an excellent plotting library called matplotlib.
First of all, there is the issue of speed. As long as you work with the already existing matrix libraries, everything is fine, but when you are starting to work with a data type for which a native library isn’t available, performance will be a huge issue.
Of course, you can always go ahead and write your own native extension to, for example, python. But even if you use a tool like swig to take care of the tedious job of generating wrapper code, this process is still quite complicated, as you have to think about garbage collection, mapping python types to C types, and so on. For machine learning people who are not computer scientists by training, this step might be too hard.
On the other hand, for most scripting languages for the JVM like JRuby, or Groovy, accessing Java code is straightforward. This still means that you have to master two languages, but the interface between the two languages is direct, and you don’t have to worry about garbage collection, or mapping types, or wrapper code.
And for those who still believe that “Java is always slower than C by a factor of two”, be told that it does not hold as a general rule anymore. When it comes to for-loops just performing some numerical computations, Java is more or less on par with C (see my post on some benchmark for matrix operations), although array access is safe in Java (ok, unless you start to do some loop unrolling by hand, and so on). On the other hand, the JVM probably comes with one of the best garbage collectors in existence which can literally run for weeks without experiencing memory leaks, or other kind of problems.
Next, there is the issue of multicore. Most processors sold today come with more than one core, opening up the possibility of gaining better performance by parallelizing code. Most scripting languages, including python have only limited support for true multicore. Python has native threads, but also the so-called Global Interpreter Lock, meaning that only a single thread can run in the interpreter at any given time. On the other hand, the Java virtual machine fully supports multicore programming. As a consequence, newer paradigms for multicore programming like the actor model, or software transactional memory, are not available in python (or matlab, for the matter), but on the JVM (see for example, akka, or clojure’s refs).
Finally, there is really an incredible amount of high-class software projects for the JVM in particular in enterprise related areas like databases, networking, web frameworks, or infrastructure. So if you eventually need to interface to any existing enterprise environment, chances are very high that it will be running on the JVM.
There are probably more reasons, but for me, the main ones are better scripting languages (compared to matlab), easier extensions in Java, multicore programming, performance and stability, and existing libraries.
In the next post I’ll discuss the basic ingredients necessary.