Finally I find some time to recompile the ATLAS libraries for jblas. As it has turned out, the libraries which shipped with jblas 1.2.0 were 30% to 50% slower than what shipped in earlier versions. This still made jblas faster than pure Java implementations, but there are still a lot of FLOPS missing.
To solve this problem once and for all, I did some exhaustive benchmarkings of compiling and running ATLAS on different platforms. I started with 64 bit Linux and the SSE3 settings. I used the following two processors:
Above are the results for compiling on the i7 or on the Opteron and running on the i7 or the Opteron processor. The numbers shown are relative MFLOPS for large matrix-matrix multiplications (100 would mean 1 floating point operation per clock cycle). The best setting is shown in green, those above 90% of that in orange, and those below in red. Numbers which are missing didn’t compile or run.
As you can see, it’s really not that easy to pick the best configuration. It’s not even the case that the architecture presets for the given platform produce the best results.
If you build on the Core i7, the presets Core2Solo, Core2, and UNKNOWNx86 lead to good performances both on the Core i7 and the Opteron. If you build on the Opteron, there is no preset which gives good results on both processors.
So I think I’ll settle on the Core2Solo which will give the following results:
To really get the best possible performance, you’d need to use the best possible implementation, but that would mean to inflate the jar even more.
I’m note sure which road to take here. GotoBlas2 has recently been released under a BSD licence and also shows very promising results, which would further increase the number of shared libraries to put in the jar file.
Another option would be to have a central repository with all kinds of libraries, and then you would need to download and install the right one the first time you run jblas. Would that be too bad?
Posted by Mikio Braun at 2011-04-28 12:01:00 +0000
blog comments powered by Disqus