MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

All Shiny and New

I’ve finally found the time and level of frustration with blogger to move my blog to a new platform. Actually, it’s not so much of a platform, but a little script which generates my blog as a list of static pages called Jekyll.

Blogger was nice to begin with, but the edit window was always too small, and having to write you posts in HTML felt so 1990s. Jekyll on the other hand lets you use one of a number of different wiki-style mark-ups, which just feels so much better.

As a little extra, I installed jsMath, which allows me to typeset real LaTeX like this $f(x) = \sum_{n=1}^\infty x^n$.

I just couldn’t migrate all the comments, but given the fact that there was only little amount of discussion anyway, that’s probably not that much of a problem. The old blog can be found again over at blogspot, just in case.

The Open Source Process and Research

I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we’ve mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.

However, there is more to be learned from the open source software development process:

  • “Release early, release often” Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.

Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.

  • Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don’t even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.

Again, contrast this with how you usually work in science, where it’s much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn’t know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don’t usually interact easily, and much potential for collaboration and synergy is wasted.

While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren’t willing to switch to open notebook science, I think there are a few things which you can try out now:

  • Communicate to people through your blog, or by Twitter or Facebook, and let them know what you’re working on, even before you have polished and published it. And if you don’t feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.

  • Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github, sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using trac, or just a bare subversion repository. It doesn’t have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.

  • Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn’t have to be large if it’s useful. Have a look at mloss for a huge list of already existing machine learning related software projects.

jblas release 1.0

I’ve just release jblas 1.0. For those of you who don’t know yet, jblas is a matrix library for Java which is based on the BLAS and LAPACK routines, currently using ATLAS and lapack-lite, for maximum performance.

I’m using jblas myself for most of my research together with a wrapper in jruby which I plan to release pretty soon as well.

In any case, here are the most important new features of release 1.0

  • Default jar file now contains all the libraries for Windows, Linux, and Mac OS X. Both 32 and 64 bit are supported (although right now, the 64 bit version for Windows is still missing. I’ll take care of that after the holidays). The libraries are built for the Core2 platform which in my experience gives the best combined performance on Intel and AMD processors (The Hammer or AMDk10h settings give slightly better performance on AMD, but inferior performance on the Intel platform).

  • New LAPACK functions GEEV (generalized Eigenvalues), GETRF (LU factorization), POTRF (Cholesky factorization) in NativeBlas.

  • Decompose class providing high-level access to LU and Cholesky factorization.

  • Improved support for standard Java framework: Matrix classes are serializable, matrix classes provide read-only AbstractList views for elements, rows, and vectors.

  • Permutation class for random permutations and subsets.

The new jar file also comes with a little command line tool for checking the installation and running a small benchmark. If you run java -server -jar jblas-1.0.jar on my machine (Linux, 32 bit, Core 2 Duo @ 2Ghz), you get


Simple benchmark for jblas

Running sanity benchmarks.

checking vector addition... ok
-- org.jblas CONFIG BLAS native library not found in path. Copying native library from the archive. Consider installing the library somewhere in the path (for Windows: PATH, for Linux: LD_LIBRARY_PATH).
-- org.jblas CONFIG Loading libjblas.so from /lib/static/Linux/i386/libjblas.so.
checking matrix multiplication... ok
checking existence of dsyev...... ok
checking XERBLA... ok
Sanity checks passed.

Each benchmark will take about 5 seconds...

Running benchmark "Java matrix multiplication, double precision".
n = 10   :  424.4 MFLOPS (1061118 iterations in 5.0 seconds)
n = 100  : 1272.6 MFLOPS (3182 iterations in 5.0 seconds)
n = 1000 :  928.5 MFLOPS (3 iterations in 6.5 seconds)

Running benchmark "Java matrix multiplication, single precision".
n = 10   :  445.0 MFLOPS (1112397 iterations in 5.0 seconds)
n = 100  : 1273.0 MFLOPS (3183 iterations in 5.0 seconds)
n = 1000 : 1330.9 MFLOPS (4 iterations in 6.0 seconds)

Running benchmark "ATLAS matrix multiplication, double precision".
n = 10   :  428.2 MFLOPS (1070428 iterations in 5.0 seconds)
n = 100  : 3293.9 MFLOPS (8235 iterations in 5.0 seconds)
n = 1000 : 5383.2 MFLOPS (14 iterations in 5.2 seconds)

Running benchmark "ATLAS matrix multiplication, single precision".
n = 10   :  465.2 MFLOPS (1162905 iterations in 5.0 seconds)
n = 100  : 5997.3 MFLOPS (14994 iterations in 5.0 seconds)
n = 1000 : 9186.6 MFLOPS (23 iterations in 5.0 seconds)

Resources: