Marginally Interesting

Matrices, JNI, DirectBuffers, and Number Crunching in Java

One thing Java lacks is a fast matrix library. There things like JAMA and COLT (not using its LAPACK bindings), but compared to state-of-the-art implementations like ATLAS, they are orders of magnitude slower.

Therefore I've set out some time ago to write my own matrix classes based on ATLAS. It turns out that while this is quite straightforward for most of the time, there are few issues which become complicated very quickly, to the point where I wonder if Java is really the right platform for this kind of application at all.

Basically, there are two areas where the Java platform can give you quite a headache:

  • Pass large amounts of data to native code.
  • Lack of control about how good the JIT compiles code.

Passing Data to Native Code

For a number of reasons, producing highly tuned linear algebra code still requires a lot of low-level fiddling with machine code. If you want really fast code, you have to rely on libraries like ATLAS. The question is then how you pass the data from Java to ATLAS.

Maybe the easiest choice is to store the data in primitive type arrays like double[] and then access them via JNI. Only that the documentation is quite explicit about the fact that this will not work well for large arrays as you typically will get a copy of the array, and afterwards, the result has to be copied back.

The solutions proposed by Sun is to use direct buffers which permits to allocate a piece of memory outside of the VM which is then passed directly to native code.

This was also the solution adopted by our jblas library. However, it quickly turned out that the memory management for DirectBuffers is anything but perfect. As has already been pointed out by others, the main problem is that Java does not take the amount of DirectBuffers into account to compute its memory footprint. What this means is that if you allocate a large number of DirectBuffers, the JVM will happy fill all of your memory, and potentially the swap space without triggering a single garbage collection.

After this happened a few times to me I began wondering whether DirectBuffers are garbage collected at all, but it seems that everything is implement reasonably well: the memory allocated with direct buffers is managed by PhantomReferences in order to be notified when the corresponding object is garbage collected, and if you call System.gc() by hand often enough your DirectBuffers eventually get garbage collected. That is, most of the time, at least.

The problem is that there is hardly a way around this. Maybe it's my fault because direct buffers were never intended to be allocated in large numbers. I guess the typical use would be to allocate a direct buffer of some size and then use that one fixed direct buffer to transfer data between the native code and the JVM.

The only problem is that this is basically the copying scenario which direct buffers should have helped to avoid... .

I've written a little program which compares the time it takes to copy an array using the GetDoubleArrayElements() function and using the bulk put and get functions of DoubleBuffer. The results look like this (on an Intel Core2 with 2 GHz):

What is a bit funny (and also brings me directly to the second point) is that GetDoubleArrayElements takes longer than DoubleBuffer.put() and get() after a size of about 100kb. Both routines basically have to deal with the same task of copying data from a Java primitive array into a piece of raw memory. But for some reason, the DoubleBuffer variant seems to be able to uphold a certain throughput for much higher amounts of memory.

Code Optimization

The second point basically means that it's more difficult than in a traditional compile language to control how fast your code becomes after it has been compiled JIT. In principle, there is no reason why this should be true, but I've observed a few funny effects which have caused me to doubt that the HotSpot technology is optimized for number crunching.

For example, as I've discussed elsewhere, copying data between arrays, or accessing DirectBuffers leads to drastically different performances on the same virtual machine depending on the byte-code compiler. For some reason, the eclipse compiler seems to produce code which leads to much faster machine code than Sun's own compiler on Sun's HotSpot JVM.

What this means that it is very hard to optimize code in Java, because you cannot be sure whether it will lead to efficient machine code by the JIT compiler of the virtual machine. With compile languages the situation is in principle similar, but at least you can control which compiler you use and optimize against that.

Summary

So I'm not quite sure what this means. Maybe some JVM guru can enlight me on what is happening here. And I don't even see what is wrong with the way direct buffers are implemented, but on the bottom line, memory management for direct buffers is not very robust, meaning that your program will typically consume much more memory than it should.

Comments (4)

M
Matthew Willson 2013-01-11

Just reading this and https://github.com/mikiobra..., interesting stuff.

Do you know if anyone's using GetPrimitiveArrayCritical (http://docs.oracle.com/java... )  for matrix maths, and if the situation has changed at all with recent JVMs? Seems this could get around the copying overhead on JVMs which support it, although has some scary-sounding caveats attached.

M
mikiobraun 2013-01-15

 Hi Matthew,

no, no idea whether that changed or not. But I don't think so as that is clearly not one of the highest priorities. One thing to keep in mind is that the JNI is actually a API specification (meaning that it defines what different types of JVM vendors should implement), so there are very little guarantees as to the actual behavior. Internally, a JVM may do whatever it likes to if it thinks it's useful, for example, splitting up arrays into multiple parts, storing in alternative formats. The JNI specification is intentionally vague to give some room.

I also don't think interaction with native code is of the highest priority.

I've searched around a bit to find some information on how exactly the JVM stores primitive arrays, but didn't really found information.

At least the situation is better than on .NET. I recently reviewed a master thesis of a student who wrote a matrix library for C#. Under .NET, you have two kinds of heaps, and the one for large objects does no compacting garbage collection (for performance reasons, as they say). What happens is that your memory is quickly fragmented so much that you run out of space. The student has devised in essence his own memory management routines with a memory pool to reuse matrices. At least that's something you don't have to deal with on the JVM ;)

-M

S
Sean O'Connor 2013-11-22

I didn't have any problem in relation to copying arrays using JNI. If I remember I just got a pointer to the array data from the JNI c method and used an assembly language copy routine to move the data to a 16 byte aligned array for SSE processing. I think you should be aware that method invocation in Java is very slow. While the code inside the method is executed very quickly. This is a problem with JNI because you cannot avoid method invocation by in-lining the code. Using native methods makes better sense when you pass large arrays. Also a strong awareness of the size of the L1 and L2 cache on the processor is needed, whether in Java, C or assembly language. Ideally you should process as much of the data as possible in L1 sized chunks. Cache awareness will lead to dramatic improvements in the speed of execution of your code.

I looked up my code and I see I used:
(*env)->GetFloatArrayRegion(env,inVec,0,inSize,vec);
to get the data and
(*env)->SetFloatArrayRegion(env,outVec,0,inSize,vec);
to set the data.

See also: http://www.agner.org/optimize/

S
Srinadh Tanniru 2016-06-30

Nice article. But, it is better to post with examples. In that we can understand in clear way. I trained in Java on online. In the practice they gave a example of matrix with area of rectangle in Java program. But, they didn't give much information on matrix concepts. Thanks for posting. Keep posting like this.

Back to all posts