I’m very happy to announce that my latest paper just came out at the Journal of Machine Learning research. Actually, it is the second half of work which developed out of my Ph.D. thesis. The other paper came out almost two years go, which tells you a bit about how long reviews can take if you’re unlucky.
Anyway, from a theoretical point of view, the first paper studies eigenvalues of the kernel matrix while the second one studies eigenvectors, and both derive approximation bounds between the eigenvalues and the eigenvalues you’d get as the number of training points tends to infinity. These bounds have the important property that they scale with the eigenvalue under consideration. You don’t have one bound for all eigenvalues, but instead a bound which becomes smaller as the eigenvalue becomes smaller. This means that you won’t have the same bound on eigenvalues of the order of $10^3$ and $10^{-6}$, but actually, the error will be much smaller on the smaller eigenvalue. If you run numerical simulations, you will immediately see this, but actually proving this was a bit harder.
What this practically means is a different story, and I personally
think it is actually quite interesting (of course :)
): If you have
some learning problem (a supervised one), and you take a kernel and
start to train you support vector machine, then these results tell you
that even if the kernel feature space might be very high-dimensional,
the important information is contained in a low-dimensional subspace,
which also happens to be spanned by the leading kernel PCA components.
In other words, when you choose the right kernel, you can do a PCA in feature space, and just consider a space spanned by the directions having largest variance. In essence, if you choose the correct kernel, you are dealing with a learning problem in a finite-dimensional space, which also explains why these methods work so well.
The “old” story was that you’re using hypothesis classes which have finite VC dimension, and then everything’s fine. This is still true, of course, but these new results also show why you can expect low-complexity hypothesis classes to work at all: Because a good kernel transforms the data such that the important information is contained in a low-complexity subspace of the feature space.
So I hope I could make you a bit interested in the paper. I’m trying to put together a bit more of an overview on my home page, so make sure to check that out as well in a week from now.
Posted by at 2008-09-04 20:16:00 +0000
blog comments powered by Disqus