Monday, January 10, 2011
jblas 1.2.0: A look behind the scenes
File under: machine room
I’ve just release jblas 1.2.0. The main additions are: Generalized eigenvalues and some support for 64 bit Windows in the form of pure FORTRAN (i.e. non-ATLAS) libraries:
jblas now has routines for generalized eigenvalues for symmetric matrices. The code has been provided by Nicolas Oury. See org.jblas.Eigen.
Finally, jblas comes with prebuilt libraries for Windows 64 bit. The bundled libraries are not ATLAS, though, which still cannot be compiled using cygwin, but lapack-lite. They aren’t terribly fast (matrix-matrix multiplication is about 50% faster than what I managed to do in pure Java), but at least you have the full functionality under Windows with 64 bit.
To celebrate this event, I thought I’ll let you in on some of the internals behind jblas, be it only to make sure that you never want to do this yourself ;)
The short version of what jblas does is that it builds a matrix library on high performance BLAS and LAPACK implementations like ATLAS.
As usual, the long version is a bit more involved. Here is a few of the things which need to be done to achieve this:
Compile ATLAS or another implementation of BLAS or LAPACK.
Create JNI stubs for each FORTRAN routine you want to package. Note that JNI is for C, so actually you have to bridge between C and FORTRAN as well in the stubs by translating C to FORTRAN calling conventions.
Create Java classes with lots of “native” methods so that Java knows about your functions.
Finally, write the matrix classes which use the native code.
For ease of use, package the shared libraries into the jar file and have them extract and load automatically for the right operating system, and platform, and possible processor type.
Automating Stub Generation
Since writing the JNI stubs is highly repetetive code, I actually wrote a bit of Ruby which parses the FORTRAN source code for BLAS and LAPACK, extract the signatures of the FORTRAN functions and automatically generate the JNI stubs. This is the code you find in the scripts subdirectory.
jblas actually does a bit more than just parsing out the type signatures. BLAS and LAPACK both use highly standardized comment sections which also identify which of the variables are input and which are output (FORTRAN always passes by reference, so you can write all the arguments passed to your function). I use this information to be more selective when freeing arrays in the stubs. In JNI, when you free an array, you can indicate whether you want to copy back the changes or not (JNI_ABORT vs. returning a zero). Since this copying forth and back is an expensive operation, I try to identify when it is not necessary and do not copy the data back in those cases.
The code generated by the stubs also checks whether arrays are used in more than one place (when you pass an array twice to a function in different arguments), in order to further minimize the number of copy operations. For some operations like copying data within one array, this alias detection is also strictly necessary, because if you would copy the array twice, it would depend on the order in which you release the arrays whether the changes will be copied back or not.
Another issue with LAPACK is the automatic computation of workspace sizes. Many of the routines require additional work space, and they have a special way of querying the amount of space required (usually by calling with a specific flag). Again this type of code is highly repetitive, so I also added code to detect workspace variables (usually ending in WORK) and also generate that code on the Java side.
Finally, depending on whether you use f2c or gfortran, there are different calling conventions for passing back complex numbers.
More Code Generation
Another area where I resorted to code generation was with float versions of all routines. Since Java isn’t generic in primitive types, you have to basically write a float version of all double version by hand. I’ve automated this project again with some Ruby scripts (one which generates for example FloatMatrix from DoubleMatrix, and one which duplicates each function with a float version, for example, in classes like Eigen).
These Ruby scripts are run as part of the build process.
The jar file contains the shared libraries for each operating system and processor subtype (where applicable). In order to determine the operating system, jblas uses the os.name and os.arch system properties. For distinguishing between SSE2 and SSE3, a bit more magic is necessary. In the class org.jblas.util.ArchFlavor, I again use some native code to invoke the CPUID command to determine the processor’s capabilities.
Once jblas has identifies the right shared library, it is extracted from the jar file with getResourceAsStream and copied to a temp directory from where the shared library is loaded with System.load().
The jblas Build Process
The build process is divided into a native part which generates the JNI stubs, and a Java part which regenerates the float versions and compiles the Java classes. This means that in the ideal case where you are just adding more functionality on the Java side, you don’t have to go through the native process at all, but can just work with all the shared libraries which are contained in src/main/resources.
The configure scripts is actually something homebrewn in Ruby. At that time it seemed to me that given the mix of C and Java, and quite specific operations like finding out which is the right LAPACK library containing all the required functions is already so specific that I’d be more happy if I wrote something myself instead of trying to make autotools do that. Actually, the configure script is structured like a Rake file in terms of interdependent configure variables which are then automatically invoked in the right order, but that is another story… .
The only time you need to touch the shared libraries is when you add new FORTRAN routines. Unfortunately, this also means you have to regenerate the code for all platforms, which is the reason why such releases take me a few day to finish as I don’t have all computers available in one place.
In summary, there is a lot going on behind the scenes to give you just that: A jar file which you can just put into your classpath end provides with really high-performance matrix routines.
Posted by Mikio L. Braun at 2011-01-10 16:35:00 +0100