Command Line Interactive Machine Learning on the JVM. Part 3: Missing Parts

Monday, April 19, 2010

This is Part 3 of a series. Part 1, Part 2.

In Part 2 I’ve discussed different options as a scripting options for a command line environment to do machine learning data analysis. In the final part, I want to mention two areas where I see most need for improvement currently.

No good readline for Java

You need some minimal editing capabilities on the command line to be productive. The most well-known project seems to be jline. It is used by practically all scripting languages on their shell, for example, JRuby, Groovy, Scala. There exists an interface to readline from Java, but GNU readline is distributed under GPL license on purpose which is quite incompatible with less restrictive licenses like BSD.

However, in its current form, JLine is quite buggy. Most importantly, it lacks the convenient “Search Backward in History” feature which I use a lot to find lines in the history. I and many others have forked from JLine to clear the code base up and add features. For example, I’ve added the search facility (works ok on Linux, try at your own risk ;), while Jason Dillon has cleared up the code base significantly.

Still, JLine is actually quite a hack. It uses the stty command to control the terminal, meaning that it integrates quite poorly with changes of the terminal window size, or signals. On Windows, it has the annoying bug that you cannot see the cursor as you move it around.

Some work would should be put into cleaning the code base, adding sensible terminal control and more features, but as it sort of works, nobody (including me, of course) feels the urge or has the time to really do something about this.

No flexible plotting for Java

Concerning the plotting library, probably the most well-known is JFreeChart, but I’m not really satisfied with that library for a number of reasons: Although it is open source, you have to buy a book to get some decent documentation (javadocs are available, though). JFreeChart produces some nice plots, but I think they are closer to what you get in Excel than what matlab provides. JFreeChart also comes with its own classes for handling the data which means that you have to copy your data into those structures to display them. There are some more options, but none of them seems as feature rich as JFreeChart.

One other problem is that printing is more or less broken under Linux when you’re relying on CUPS. On my debian box, I invariably get a “No Printing Services found” error every time I try to print from any Java program. There are also some bugs which haven’t been fixed in years. The bottom line is that you cannot really rely on the built in printing capabilities of Java to generate plots for your paper - which is really a shame.

Other options probably are to use a SVG library like batik, or switch to pure Javascript graphics libraries like Raphaël or processing.js to do the plotting inside a web browser.

So in summary, there are two main missing features: A feature rich, stable readline replacement, and a flexible plotting solution which also prints.

Some pointers

I haven’t talked about this at all until now, but of course there are also already several machine learning toolboxes in Java or other JVM related languages. Of course, these projects are more or less ignorant of one another, yet, so more work would be require to write some common interfaces. Here is just a short list to get you started, also look at mloss.org

  • Weka is quite mature and comes with a GUI to do experiments.
  • JavaML is a collection of many common machine learning algorithms.
  • Apache Mahout is a library for doing map-reduce-style machine learning on a Hadoop cluster.
  • Finally, there are also several more specialized projects, for example RL Glue and Codecs for reinforcement learning, or factorie for graphical models.

Don’t hesitate to post more links in the comments!

Posted by Mikio L. Braun at 2010-04-19 12:55:00 +0000

blog comments powered by Disqus