So what is it about? In a nutshell, we try to speed up cross-validation by starting with subsamples of the data and identifying quickly parameter configurations which are clearly suboptimal. Learning on subsets is of course much faster so ideally you’ll save a lot of time because you will only have a handfull of parameter candidates left on the full data set.
The method is based on the sequential analysis framework which deals with the problem of statistical hypothesis testing when the sample size isn’t fixed.
The main problem one faces is that the performance of parameter configurations changes significantly as the sample size increases. For a fixed parameter configuration (say a kernel width and a regularization parameter for an SVM), it is clear that the error converges, and usually becomes smaller as the number of samples increases. However, if one compares two configurations, one can often observe that one configuration is better for small sample sizes, while the other becomes better later on. This phenomenon is linked to the complexity of the model associated with a parameter choice. General speaking, more complex models require more data to fit correctly and will overfit on too few data points.
Our method accounts for this effect by adjusting the statistical tests to maximize the number of failures before a configuration is removed from the set of active configurations. Nevertheless, fast cross-validation is faster by a factor of 50-100 on our benchmark data sets.
Update June 13, 2012: We’ve just submitted a journal version to JMLR. You can get the preprint from arxiv
Posted by Mikio L. Braun at 2011-12-20 14:58:00 +0100blog comments powered by Disqus