Monday, January 10, 2011

jblas 1.2.0: A look behind the scenes

File under: machine room

I’ve just release jblas 1.2.0. The main additions are: Generalized eigenvalues and some support for 64 bit Windows in the form of pure FORTRAN (i.e. non-ATLAS) libraries:

To celebrate this event, I thought I’ll let you in on some of the internals behind jblas, be it only to make sure that you never want to do this yourself ;)

The short version of what jblas does is that it builds a matrix library on high performance BLAS and LAPACK implementations like ATLAS.

As usual, the long version is a bit more involved. Here is a few of the things which need to be done to achieve this:

Automating Stub Generation

Since writing the JNI stubs is highly repetetive code, I actually wrote a bit of Ruby which parses the FORTRAN source code for BLAS and LAPACK, extract the signatures of the FORTRAN functions and automatically generate the JNI stubs. This is the code you find in the scripts subdirectory.

jblas actually does a bit more than just parsing out the type signatures. BLAS and LAPACK both use highly standardized comment sections which also identify which of the variables are input and which are output (FORTRAN always passes by reference, so you can write all the arguments passed to your function). I use this information to be more selective when freeing arrays in the stubs. In JNI, when you free an array, you can indicate whether you want to copy back the changes or not (JNI_ABORT vs. returning a zero). Since this copying forth and back is an expensive operation, I try to identify when it is not necessary and do not copy the data back in those cases.

The code generated by the stubs also checks whether arrays are used in more than one place (when you pass an array twice to a function in different arguments), in order to further minimize the number of copy operations. For some operations like copying data within one array, this alias detection is also strictly necessary, because if you would copy the array twice, it would depend on the order in which you release the arrays whether the changes will be copied back or not.

Another issue with LAPACK is the automatic computation of workspace sizes. Many of the routines require additional work space, and they have a special way of querying the amount of space required (usually by calling with a specific flag). Again this type of code is highly repetitive, so I also added code to detect workspace variables (usually ending in WORK) and also generate that code on the Java side.

Finally, depending on whether you use f2c or gfortran, there are different calling conventions for passing back complex numbers.

More Code Generation

Another area where I resorted to code generation was with float versions of all routines. Since Java isn’t generic in primitive types, you have to basically write a float version of all double version by hand. I’ve automated this project again with some Ruby scripts (one which generates for example FloatMatrix from DoubleMatrix, and one which duplicates each function with a float version, for example, in classes like Eigen).

These Ruby scripts are run as part of the build process.

Multi-platform Jars

The jar file contains the shared libraries for each operating system and processor subtype (where applicable). In order to determine the operating system, jblas uses the and os.arch system properties. For distinguishing between SSE2 and SSE3, a bit more magic is necessary. In the class org.jblas.util.ArchFlavor, I again use some native code to invoke the CPUID command to determine the processor’s capabilities.

Once jblas has identifies the right shared library, it is extracted from the jar file with getResourceAsStream and copied to a temp directory from where the shared library is loaded with System.load().

The jblas Build Process

The build process is divided into a native part which generates the JNI stubs, and a Java part which regenerates the float versions and compiles the Java classes. This means that in the ideal case where you are just adding more functionality on the Java side, you don’t have to go through the native process at all, but can just work with all the shared libraries which are contained in src/main/resources.

The configure scripts is actually something homebrewn in Ruby. At that time it seemed to me that given the mix of C and Java, and quite specific operations like finding out which is the right LAPACK library containing all the required functions is already so specific that I’d be more happy if I wrote something myself instead of trying to make autotools do that. Actually, the configure script is structured like a Rake file in terms of interdependent configure variables which are then automatically invoked in the right order, but that is another story… .

The only time you need to touch the shared libraries is when you add new FORTRAN routines. Unfortunately, this also means you have to regenerate the code for all platforms, which is the reason why such releases take me a few day to finish as I don’t have all computers available in one place.

In summary…

In summary, there is a lot going on behind the scenes to give you just that: A jar file which you can just put into your classpath end provides with really high-performance matrix routines.

Posted by Mikio L. Braun at 2011-01-10 16:35:00 +0100

blog comments powered by Disqus