Fast Cross Validation
Fast Cross Validation Via Sequential Analysis - Talk
These are the slides to our talk (joint work with Tammo Krüger and Danny Panknin) at the BigLearning workshop at NIPS 2011. You can also have a look at the paper and the appendix.
So what is it about? In a nutshell, we try to speed up cross-validation by starting with subsamples of the data and identifying quickly parameter configurations which are clearly suboptimal. Learning on subsets is of course much faster so ideally you'll save a lot of time because you will only have a handfull of parameter candidates left on the full data set.
The method is based on the sequential analysis framework which deals with the problem of statistical hypothesis testing when the sample size isn't fixed.
The main problem one faces is that the performance of parameter configurations changes significantly as the sample size increases. For a fixed parameter configuration (say a kernel width and a regularization parameter for an SVM), it is clear that the error converges, and usually becomes smaller as the number of samples increases. However, if one compares two configurations, one can often observe that one configuration is better for small sample sizes, while the other becomes better later on. This phenomenon is linked to the complexity of the model associated with a parameter choice. General speaking, more complex models require more data to fit correctly and will overfit on too few data points.
Our method accounts for this effect by adjusting the statistical tests to maximize the number of failures before a configuration is removed from the set of active configurations. Nevertheless, fast cross-validation is faster by a factor of 50-100 on our benchmark data sets.
Update June 13, 2012: We've just submitted a journal version to JMLR. You can get the preprint from arxiv
Comments (6)
Interesting! You say "speed up cross-validation", I hear "try more candidates for cross-validation in the same time" ;-) The cost of including a broader set of candidates should be lower than a denser set of candidates, right? Assuming the extra candidates can be excluded quickly with higher likelihood in a broader set. I suppose in practice this isn't something you'd need often, unless you're *really* flying blind with respect to optimal parameter ranges.
Hi Isaac,
how are you?
Yes, you are right, it can go both ways, either in terms of less time spent or more (broader/denser, whatever) sets of candidates.
Whether you need this or not in practice depends a lot on your application, of course. If the data doesn't change that much, you already have quite a good guess what the right parameter range is, of course. However, getting this kind of insight already takes quite some time, and you probably want to fine tune over this range with cross-validation nevertheless, such that our method becomes interesting again.
Then there also exist applications where you have to retrain some predictor on data which is varying a lot. For example, in one project, we're using fitted Q-iterations to learn in a reinforcement learning setting. There, you have to iteratively learn a regression problem which depends on what you've learned before, such that you really don't have to much insight into how you're problem is going to change. Being able to start with a pretty broad range of parameters is crucial there.
Also, for high-dimensional data it might be quite difficult to easily guess what the right parameter range is.
Anyway, we're preparing some R package based on the method, and later on probably also code for Matlab, etc. so that people can use the method out of the box.
-M
The use case I had in mind was a more "just works out-of-the-box" approach - if the machine learning researcher and the person who wants to do the learning are not the same person, being able to throw a very wide net could be useful.
I suppose if one wanted to - and there was a reasonable assumption that the generalization error over the parameter space wasn't *too* discontinuous - one could replace candidates that are rejected with candidates chosen from the space around the better candidates. An evolutionary approach where parameters are combined with probability proportional to how much better they are than the other options might be interesting. The same stopping criterion or a "stable" solution should even work there. (Or one could do some sort of simulated annealing on the "new" parameters. Or only replace a rejected candidate with a new one with probability 1/i. Or or or...)
Looks very cool & useful at first sight. Will have a closer look at the paper.
Hello. Sorry since thie is unrelated.
Any chance I could find the source code for your Stability-based validation 2004 paper?
I'm trying to test it on some accelerometer data clusterings.
Thanks fo your time. Best regards
Vp
Hi Vp,
sorry, no I don't think I still have that code. It probably also wouldn't be of much use as you have to plug in some supervised learner and that depends on your setup. The hardest part is implementing the Hungarian method, but you can find it, for example, in the book "Combinatorial Optimization" by Papadimitriou.