Fast Cross Validation Via Sequential Analysis - Talk
These are the slides to our talk (joint work with Tammo Krüger and Danny Panknin) at the BigLearning workshop at NIPS 2011. You can also have a look at the paper and the appendix.
So what is it about? In a nutshell, we try to speed up cross-validation by starting with subsamples of the data and identifying quickly parameter configurations which are clearly suboptimal. Learning on subsets is of course much faster so ideally you’ll save a lot of time because you will only have a handfull of parameter candidates left on the full data set.
The method is based on the sequential analysis framework which deals with the problem of statistical hypothesis testing when the sample size isn’t fixed.
The main problem one faces is that the performance of parameter configurations changes significantly as the sample size increases. For a fixed parameter configuration (say a kernel width and a regularization parameter for an SVM), it is clear that the error converges, and usually becomes smaller as the number of samples increases. However, if one compares two configurations, one can often observe that one configuration is better for small sample sizes, while the other becomes better later on. This phenomenon is linked to the complexity of the model associated with a parameter choice. General speaking, more complex models require more data to fit correctly and will overfit on too few data points.
Our method accounts for this effect by adjusting the statistical tests to maximize the number of failures before a configuration is removed from the set of active configurations. Nevertheless, fast cross-validation is faster by a factor of 50-100 on our benchmark data sets.
Update June 13, 2012: We’ve just submitted a journal version to JMLR. You can get the preprint from arxiv
Apparently, the discussions about “Scala being too complex” are heating up, mostly due to a leaked email from one of Yammer’s programmers to the Scala people where he discusses some of his experiences he’s had with using Scala in a production environment, and the other being a post on HN comparing Scala to Perl in the sense that both languages have too much flexibility in solving a specific task leading to a mix of different programming paradigms and styles which will make you code harder to read and maintain.
Now we’ve been using Scala as our main programming language for the last two and half years for TWIMPACT, so I know what people are talking about. And the truth is, it is all true, sadly. On the one hand, Scala is a pretty awesome programming language which is very nicely designed. I’ve said this before, but normally you will eventually come across some feature of a programming language which is not designed well and you have to code your way around it, but I’ve yet to come about something like it in Scala.
On the other hand, it is also true that some of the libraries are not as fast as they should be. Although I like the idea of immutable collections a lot, every time I need performance, I’d rather put in a Java collection. Also, it’s true that the collection library is pretty complex. It all kind of makes sense to get a clean design of the classes, but it’s pretty complicated with all those classes like Seq, SeqLike, Traversable, TraversableOnce, etc. However, you’ll probably only need to know all the details if you want to write your own collections which integrate seamlessly with the existing collection classes.
It’s also true that upgrading to a new version is hard. For some reason, many libraries seem to be quite deeply interlocked with the Scala version. While our own code never had to be changed if Scala went to a new version, this wasn’t true for most libraries, unfortunately, meaning that you have to wait till all the libraries have been upgraded to the new version before you can do the update yourself. And frankly, I don’t see why this is necessary.
We’ve never bothered with sbt, but directly went for maven due to it’s better integration in most IDE’s. We’re using IntelliJ IDEA whose Scala plugin has come a long way and gives pretty good support. There is also a lot to be improved in the basic tools like the compiler or the shell in terms of startup time. Scala seems to preload several megabytes of jar files on startup, probably in an attempt at optimization, but in the end, it only means that starting Scala takes anywhere between 5 - 10 seconds which is really a lot if you’re working on the shell (and every other language starts up almost immediately) The guys behind JRuby have invested a lot of time to cut down on the startup time, and that was time well spent.
People are also often attacking Scala for it’s complexity. While it’s certainly true that it’s easier to hire some Java expert than someone who knows Scala, IMHO Scala is a big improvement in many ways over Java, which feels overly verbose once you’ve learned Scala. As with every language, there are more basic concepts and more advanced concepts and usually, you don’t have to master them all from the start. Also, people often argue as if the complexity about learning a programming language is all in the programming language, but you also have to consider the standard libraries and tools. For example, while the Java programming language is relatively simple in terms of concepts, the standard tools and frameworks are pretty intimidating to learn (all that XML, Maven, Spring, etc.)
Then people are also complaining about the community, which is supposedly not helpful enough, or too fragmented, or only consists of crazy people who are just thinking about how to implement everything in terms of category theory. I don’t think that is true. Scala is still young, and the community can still grow. We’ve uncovered a number of bugs (mostly Leo who has a knack for finding bugs in libraries) and people were mostly as responsive as you’d expect them to be. One of the strengths of Scala is also that it is quite painless to reuse existing Java projects (as any other programming language for the JVM). I never found it that repulsive as some seem to use a Java library from Scala. The integration is quite painless, and if you really have to, you can add a bit of syntactic sugar on your side for the stuff you need most.
Finally, I really don’t get the argument of people who are saying “Scala is too complex, I switched to Python (or some other scripting language)”. To me, these are completely different sets of programming languages. While it’s true that there are some applications like writing medium sized web sites which you can nowadays do in either a scripting language or a compiled language, there are many applications where Python (or any other scripting language) just can’t compete. In scripting languages, it’s hard to add primitive data types which are really fast unless someone else already took care of implementing the most computing-intensive routines in C.
So in summary, Scala is both awesome and awful, just like almost every piece of sufficiently advanced technology. You can work with Scala, and it’s a lot of fun, or you can reject it for a number of reasons, just acknowledge the complexity and don’t give in to hypes and marketing.
Updates: Coda Hale’s comment on the leaked email, Yammer’s official statement on Scala, Typesafe’s post on their committment to the industry.
Analyzing social media has become quite popular. People have been predicting box office openings based on Twitter chatter, studied information diffusion patterns, information flows between classes of users, how real-world events like earthquakes are reflected in Twitter.
This is all pretty exciting and interesting, but there are also a few things where there is still room for improvement.
There is very little stuff on real-time analysis. Many papers boast with the hundreds of millions of tweets (and the access to Twitter’s firehose necessary to get that amount of data) which have formed the basis for the paper. However, many papers later introduce some more or less arbitrary ways of truncating the data, for example by taking a number of “most active users”. This is both true for Jure Leskovec’s paper as well as the Yahoo research’s paper.
However, I think that getting to real-time is extremely important, because you cannot just wait for days or longer to get your analysis. By that time, more data will have been streaming in, and when are you going to analyze that data?
Another problem with many of the analyses is that they focus on the positive cases only. Meaning that they develop some method to detect bursts or trends and then use some famous real-world example (like Japan winning the women’s soccer championship) to show that the method is triggered by the data. However, few publications go so far as to validate their method on negative examples as well, showing that the method not only detect trends well, but also does so robustly with few false positives.
A classical example is the highly cited 2003 paper by Jon Kleinberg “Bursty and Hierarchical Structure from Streams” which explains how to detect areas of higher than usual activity, for example, from email streams. But then, the paper shows how the detected structure coincides with real deadlines for two examples without discussing negative examples in depth.
Many methods also seem to believe that an analysis which is based on hundreds of millions of data points is automatically true in general. While this is certainly true for simple statistics which you can estimate well, there are other methods which can overfit. And for those, as many other disciplines like bioinformatics have had to learn the hard way, as you get more data, the probability that you find some evidence for your hypothesis increases drastically.
To get reliable results, you need to follow the same rules as when validating the performance of a machine learning algorithm: Test on data which is disjoint from training data. If your method detects trends, check it on data which you believe has no structure. If you aggregate topics, check it on days when nothing special was happening. If you analyze the structure of the data, check on an independent sample (ideally from a period of time which is a bit removed from the original sample).
That way you might have less data available, but your results will improve a lot in terms of reliability.