Fast Matrix-matrix mulitplication: Kazushige Goto

If you ever have dealt with anything related to linear algebra, you most probably know that the matrix-matrix multiplication is the most basic operation of it all. Matrix-vector multiplication is just a special case, and almost anything else, from solving linear equations to computing eigenvalues and eigenvectors depends on matrix-matrix multiplication as their basic building blocks.

Well fortunately, matrix-matrix multiplication is quite easy to implement. In C its really just three for-loops

mmmult(double **a, double **b, double **c,
       int n, int m, int l) {
  for(int i = 0; i < n; i++) 
     for(int j = 0; j < m; j++)
        for(int k = 0; k < l; k++)
           a[i][j] += b[i][k] * c[k][j]
}

(If we omit the zeroing of the initial matrix)

Of course, the above code sucks, I even haven't used any pointers, nor declared variables as register. I mean, we are talking about C here, not one of those shiny new programming languages which want to get between us and the machine!!

Well, the truth is that all of your die hard C optimizing won't help in any way, and compared to implementations such as those of ATLAS, your code is slower by a factor of 10 or more.

The reason is that the above code is very suboptimal (to put it nicely) in terms of locality. If the matrix is too large for the L1 cache, or even the L2 cache, you will constantly have to fetch your data from the (still) slow main memory.

So what any decent matrix-matrix multiplication algorithm does is that it subdivides the matrix into patches which fit into the caches and then handles these sub-matrices one after the other. The good news is that this is absolutely no problem for matrix-matrix multiplication. If you look at the formula, you just have to cumulate certain products to compute one entry of the result matrix, but the order of how you enumerate these products is completely up to you.

This is what ATLAS and every other linear algebra library do, and the results are quite impressive. They manage to keep the pipeline of the processor so full that they come close to the theoretical limit, which amounts to 1.5~2 double floating point operations per clock cycle (!). The "AT" in ATLAS stands for "Automatically Tuned", which means that ATLAS experimentally estimates the size of the caches, and also the optimal order floating point operations.

But caring for the L1 and L2 caches is not enough. Kazushige Goto of University of Texas in Austin has written the currently fastest implementation of the basic linear algebra routines, and he has optimized them by hand.

He says that optimizing for the caches in nice, but you really have to take care of the Translation Lookaside Buffer. If your course on processor technology has been some time ago, the TLB is the cache for the page descriptors necessary to resolve virtual memory lookups.

Now that's what I call commitment ;)

React to this post