MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Quo vadis?

Actually, I’ve been thinking a bit about what to do with this blog. So far, I’ve even kept it sort of semi-private by not linking from my work homepage to it, maybe because I didn’t want to have to worry if it is official enough, or if it reflects my scientific interest well enough. Consequently (and maybe because I don’t have anything interesting to say), almost nobody is reading my blog and even friends are sometimes surprised that I have a blog.

I’ve thought about what I could do to make my blog more relevant and also more interesting, you know, concentrate on a few topics to give people a better idea of what to expect, and maybe also to give them a reason to come back and actually follow the blog.

Maybe the most interesting revelation was that the topics I cover in my blog are quite different from my scientific work, which is often quite heavy on the theory, and less concerned with the things I apparently like to blog about: programming languages, nice tools, and the occasional insight into some technology twist.

So I guess I’ll try to accept these different aspects of my interests, and refrain from attempts to streamline my web presence.

This may be somewhat unrelated, but I also chose to rename the blog from “Trunk of Heroes” (whatever that was supposed to mean) to “Marginally Interesting” which is such a nice long phrase which also conveys only little information. At least, now I can say funny things like “My blog is Marginally Interesting” :)

Anyhow, the semester is over which means that I’ll have more time doing some research and - of course - learning some exciting new piece of technology.

JRuby 1.1.3 and Jython 2.5

Just for the record, jruby 1.1.3 has been released. Startup time is again down a bit but not overly much so. All in all, I think they are doing a terrific job. On a related note, jython has also released an alpha version of jython which is going to be compatible with python 2.5. The last bigger release of jython is already a bit old and was compatible only to python 2.2.

On the other hand, I find it harder and harder to choose between the two languages. Somehow, they seem to fill in almost the same spot, and it is only a question of community if you’re more into ruby or python. In any case, if you’re looking for some nice integration with java, you’re going to have both alternatives soon, which is a good thing, I guess.

Data Mangling Basics

When you’re new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn’t come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren’t just so good. If you’re lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you’ve never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

  • Take subsets You might find this pretty obvious, but I’ve seen students debugging their methods directly on the 10000 instances data set often enough to doubt that. So when you have a new data set and you want to quickly try out several methods, take a random subset of the data until your method can handle the data set in seconds, not minutes or hours. The complexity of most data sets is also such that you usually get somewhat close to the achievable accuracy with a few hundred examples.

  • Plot histograms, look at scatter plots Even when you have high-dimensional data, it might be very informative to look at histograms of individual coordinates, or scatter plots of pairs of coordinates. This already tells you a lot about the data: What is its range? Is it discrete or continuous? Which directions are highly correlated, and which are not? And so on. Again, this might seem pretty obvious, but often students just run the specified method without looking at the data first.

  • Center and normalize While most method papers make you think that the method “just works”, in reality, you often have to do some preprocessing to make the methods work well. One such step is to center and normalize your data to have unit variance in each direction. Don’t forget to save the offset and normalization factors: you will need them to correctly process the features for prediction!

  • Take Logarithms Sometimes, you have to take the logarithms of your data, in particular when the range of the data is extremely large and the density of your data points decreases as the values become larger. Interestingly, many of our own senses work on a logarithmic scale, as for example, hearing and vision. Loudness is for example measured in decibel, which is a logarithmic scale.

  • Remove Irrelevant Features Some kernels are particularly sensitive to irrelevant features. For example, if you take a Gaussian kernel, each feature which is irrelevant for prediction increases the number of data points you need to predict well, because for practically each realization of the irrelevant feature, you need to have additional data points in order to learn well.

  • Plot Principal Values Often, many of your features are highly correlated. For example, the height and weight of a person is usually quite correlated. A large number of correlated features means that the effective dimensionality of your data set is much smaller than the number of features. If you plot the principal values (eigenvalues of the sample covariance matrix $C^TC$, you usually will notice that there are only a few large values, meaning that a subset of the space will already contain most of the variance of your data. Also note that a projection to those dimensions is the best low-dimensional approximation with respect to the squared error, so using this information, you can transform your problem in one with fewer features.

  • Plot Scalar Products of the output variable with Eigenvectors of the Kernel Matrix Finally, while principal values only tell you something about the variance in the input features, if you plot the scalar products between the eigenvectors of the kernel matrix and the output variable in a supervised learning setting, you can see how many (kernel) principal components you need to capture the relevant information about the learning problem. The general shape of these coefficients can also tell you if your kernel makes sense or not. Watch out for an upcoming JMLR paper for more details.

  • Compare with k-nearest neighbors Finally, before you apply your favorite method on the data set, try something really simple like k-nearest neighbors. This can get you a good idea of what kinds of accuracies you can expect.

So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I’d be interested to know what initial procedures you apply to your data, so feel free to add some comments.