Thursday, April 30, 2009

Machine Learning Twibe

Twibes is some new twitter-related website which manages topic-related groups of people and collects tweets based on up to three tags.

Since apparently nobody did so far, I set up a twibe on machine learning. Follow the link and click on "Join" on the right and side to join.

Currently, the group is picking up tweets with either "machine learning", "#machlearn", or "#machine-learning" in them. Anyone got an idea how to improve the tags?

Monday, April 06, 2009

Some Benchmark Numbers for jblas

The last few days I've been putting together a benchmarking tool for jblas which I'll hope to release to the public as soon as I've figured out how to best package that beast.

It compares my own matrix library, jblas against the Matrix Toolbox for Java, Colt, as well as simple implementations in C and Java, and, of course, ATLAS.

All experiments were run on my Laptop which sports an Intel Core 2 CPU (T7200) with 2GHz. Maybe the most remarkable feature of this CPU (and many other Intel CPUs as well) is the rather large 2nd level cache of 4MB which will come in handy as we'll see below.

Some more information on the plots: C+ATLAS amounts to taking C for vector addition and ATLAS for matrix vector and matrix matrix multiplication. Shown is the number of theoretical floating point operations. For example, adding two vectors with n elements requires n additions, multiplying a matrix with n squared elements requires 2n^2 additions and multiplications. Now I know that there exist matrix multiplication algorithm which require less than cubic number of operations, but I've just sticked to the "naive" number of operations as that is what ATLAS implements, as far as I know.

Vector Addition



In vector addition the task is simply to add all elements of one vector to all elements of the other vector (in-place). What is interesting about vector addition is that there is really little you can do to optimize the flow of information: You just have to go through the whole vector once. Basically, all methods are on par, with the exception of Colt and ATLAS (!!). No idea what they did wrong, though.

You can also very nicely see the L1 and L2 cache edges resulting in the two steps. On CPUs with smaller L2 cache (like, for example, most AMD CPUs), the shoulder is much less pronounced.

Matrix-Vector Multiplication



Matrix-vector multiplication can be thought of as adding up the columns of a matrix given the weights of the vector. Similar to vector addition, you basically have to go through the whole data once, there is little you can do about it. However, if you are clever, you can make at least sure that the input and output vectors stay in cache which leads to a better throughput.

Here, ATLAS is faster than the naive implementations by roughly 50%. jblas also uses a naive implementation. The reason is that Java needs to copy an array when you pass it to native code, and since you basically have to go through the matrix once, you loose more time copying the matrix than simply doing the computation in Java itself.

Matrix-Matrix Multiplication



In matrix-matrix multiplication the ratio of memory movements to operations is so good that you can practically reach the theoretical maximum throughput on modern memory architectures - if you arrange your operations accordingly. ATLAS really outshines the other implementations here, getting to almost two (double precision!) floating point operations per clock cycle.

Luckily, copying the n squared many floats can be asymptotically neglected compared to the n cube many operations meaning that it pays of to use ATLAS from Java. The performance of jblas is very close to ATLAS, and becomes even closer for large matrices.

Summary

On the buttom line, I'm glad that jblas is pretty fast, certainly faster than Colt and on par with MTJ except for matrix-matrix multiplication. I should add that I used the Java-only versions of Colt and MTJ. It should be possible to optionally link against the ATLAS versions, although Colt does not integrate these methods as well as jblas, and I'm not sure whether MTJ does that.

Also, in my opinion jblas is easier to use because it only uses one matrix type, not distinct types for vectors and matrices. I've tried that at first, too, but then often ended up in situations where a computation results in a row or column vector which I then had to artificially cast to a vector so that I could go on.

Of course, MTJ and Colt are currently more feature rich, supporting, for example, sparse matrices or banded storage schemes, but if you want to have something simple which performs well right now, why not take jblas ;)

Friday, January 16, 2009

Science and the Market Metaphor

John Langford had an interesting post on what he calls the "adversial viewpoint" on academia. Basically, the argument is that under this viewpoint, you assume that scientists in academia compete over a fixed set of resources (research money, positions, students) and that therefore all the other scientists are your adversaries. He suspects that this might be one of the reasons behind the decline in reviewing quality at conferences such at NIPS he has observed in the following years.

John argues that the adversarial viewpoint might make sense, but it is actually bad for science, because scientists are more focused on rejecting other papers, projects or ideas, instead of being open for new developments.

I'm not sure whether the variable quality of NIPS reviews is really due to the enormous popularity of the NIPS conference and the load this puts on the program committee and the reviewers, or if it is because people are actively destructive about their peers work.

But I think this leads to an interesting question about the environment in which academia exists, why it is like it is and how it could be changed to lead to different viewpoints. Because if it really is a zero sum game, then it is not surprising that those who want to play the game successfully adopt an adversarial viewpoint.

A Simplified History of Public Funding for Scientific Research

I'm really not an expert in history of science, but I guess it's not completely wrong to say that the way science is embedded and supported by society has changed dramatically in the 20th century. Beginning with the industrialization and in particular during World War II it became apparent that having a productive scientific community is absolutely vital, both to the overall growth of your economy, but also for national security. For example, the National Science Foundation was created specifically after World War II with that goal in mind.

Naturally, if it was that important, the government had to take control and set up ways to maximize the scientific productivity (because it had to make sure that the tax payer's money is used well). While in historic times, scientists were selected by personal preferences and paid by some rulers to work at their court, managing science and setting up the rules of the games more and more became a responsibility of politicians.

Applying the Market Metaphor to Science

The problem was of course that science had never been organized on this level, in particular not by non-scientists. The question basically was: How can we maximize the scientific output from a fixed amount of resources. I think this is a very important question, and I'm certain the optimal answer hasn't been found yet. Looking at how science is organized today, it seems that they resorted to transplanting a well known metaphor, that of a free market, where scientific ideas and research plans compete over grant money, slots in publications, and positions.

You can find this idea in many different aspects. For example, a grant call is like a customer expressing a certain need, and then companies (that is, scientists) can compete for that money and the one who offers the best research for that money will get it. I'm not saying that this is not a good way to select who to fund, but grant calls are used as a device to control the direction in which science progresses, and the question is whether this ensures the overall progress of science.

Another example is the way in which the scientific output is measured by citation counts. A scientists (=company) produces a scientific publication (=goods). Such publications are then put on show in journals (=stores) where other scientists can cite them (=buy them). The productivity of a scientist is then measured by the amount of goods sold where the quality of the store factors into the price paid by other scientists.

Science is not Economy

I'm not saying that this system does not work at all, but science and scientific research in particular have properties which conflict with the economic setup.

For one, as in art, there is an independent notion of quality for scientific work which is somewhat independent from whether it competes well in the market. For example, it might be a brilliant piece of work, but there is only very little intersection with what other people are working on right now. Or it is not what the funding agencies are focusing on right now. I think every scientist has at least once experienced the conflict between what he considers good scientific work and what he has to do to get grant money. Or put even differently, if everyone would just play the game (publish papers, get grant money, basically secure his position in the field), would that alone ensure scientific progress?

Moreover, what gets published in journals is mainly managed through the peer reviewing processes. Translated to economy, this means that your competitors have a lot of say in whether your products will actually see the market. Assume that before Apple sells its new laptops, the store will first ask Microsoft, Dell, and HP what they think of the laptops? It is clear that it's hard to do differently in science, because the you need a lot of expertise to judge whether a paper is worth publishing, which cannot be done by the store owner alone, but still, this setup introduces a lot of interaction not present in a truly free market.

In science, significant progress often comes from sidetracks. While most people are working on extending and applying a certain scheme, now approaches are often found elsewhere and take some time before they can enter the mainstream. However, a mass market (and given the number of scientists today, it certainly is a mass market) tends to produce products for the masses, and it is unclear whether a remote idea could really get enough support to work.

Science as a whole is progressing, I guess, but I believe it partly is because people manage to play the game and do the research which matters to them at the same time.

A Way Out?

I have to disappoint you right away, because I do not know the solution. But I think actually seeing the difference between a free market and science is important, and I hope it will make you think.

Others have been more brave in this respect. For example, people have thought about how to allocate money in a way which prevents us to just feed the "mass market" and also allow small independent research projects. Lee Smolin suggests an alternative way to distributed grant money in his book "The Trouble with physics". Siegfried Bär in his book Forschen auf Deutsch (Research in German) also suggests how to improve the way research money is distributed in German. I won't go into detail here, but both researchers think that the whole proposal writing business just takes up too much time, and the process should become much more flexible such that more researchers have time to actually do research, and also on the topics they are interested in. Part of the money should even be spent on ideas which really don't seem that relevant (but to scientists which have otherwise proven not to be crackpots).

If you in principle agree that citation count is a good measure of scientific progress, and you believe in the market, then the problem remains that the scientific publication culture is different from a real market because your competitors can veto that the customers see your product at all. The question boils down to how to improve the reviewing process. Marcus Hutter has archived an email discussion from 2001 on his homepage on what alternatives there are to the existing review process. John Langford suggests to also use preprint servers like the arxiv to get a time-stamp for your work, since you cannot be sure when you will manage to get it published.

I think people have naturally been thinking about improving the review process because in the age of the Internet, this is actually something we as scientists can actively control (as opposed to controlling funding policies). The whole system already depends on unpaid volunteers, so we should have enough manpower to run any other system as well if it gets enough support.

I'd like to repeat the idea of Geoffrey Hinton from the above email discussion. He proposed a system where people put endorsements for papers on their homepages, together with a trust network you define for yourself. You register other scientists whose opinions you trust when it comes to which papers are worth reading. In 2001, the setup was personal websites and a tool, but nowadays, you would certainly turn this into some Web2.0 application. citeulike seems to go in that direction, although the focus is currently more on organizing what papers you have read.

In essence, the goal is to make the path from company to customer much shorter, and in particular, to lessen the impact of your competition on whether your customer can buy your products, that is, cite your papers, or not.

Conclusion

So in summary, I think the framework within academia lives is not altogether bad, but there is always room for improvement. Currently, the market metaphor is often applied blindly without taking account the peculiarities of scientific research, or the scientific community as a whole. The perception that academia is basically a zero-sum game as voiced by John Langford is directly based on the idea that science is a competition over fixed resources. As I have pointed out, the main difference is that science is also a bit like art to the extent that it has it's own internal notion of quality and soundness which cannot be easily grasped or measured in terms of economic concepts. If we could manage to integrate these different aspects of science we might be eventually able to find better ways to run academia.

Update (Jan 30, 2009): I found an interesting blog post by Michael Nielsen. His basic argument is that we not only need new ways of exchanging, archiving, and searching existing knowledge, but also a radical change of culture, potentially backed by new online tools. For example, he argues that it would be highly advantageous if scientists could easily post problems they are stuck on and quickly find other scientists who are experts on those problems. However, people might only be willing to do this if such contributions would be tracked the same way peer-reviewed publications are.

Interestingly, I see some parallels between these ideas and the way we have been setting up mloss.org and the Open Source Software Track at JMLR. We have provided both the tool, and a means to make publishing open source software accountable under the old metrics - peer-reviewed publications.