Machine Learning and Data Sets

I've been busy taking care of my 11 month old daughter lately which leaves almost no time to do something as remotely useful as posting on my blog - not that I have been doing it more often when I was still working full time. At the same time you get a lot of ideas and potentially interesting insights, now that your brain has time to idle now and then, for example while picking up toys thrown to the ground again and again.

Anyway, I lately came to think that machine learners as a whole should devote much more time to working with actual data. In particular machine learners who think of themselves as being "method guys" (yes, this also includes me). It usually works likes this: You have some technique you really like a lot and you use it to extend an already existing method until you come up with something you think is really neat. It may have some interesting properties other algorithms don't have, and you really would like to write a paper on it.

But then, the problem starts, because in order to prove that your extension is actually useful, you will have to prove that it makes a difference practically. So you go around your group asking colleagues if they have or know of some intersting data set. We call this the "have method, need data" phenomenon.

Of course, if you had started with a concrete application in mind, you would never have to ask yourself "oh, this is great, but what is it good for?"

Also, in machine learning, the formally defined problems we have are very abstract (like minimizing the expected risk from i.i.d. drawn data points), and many of the actual challenge are actually only "defined" by specific data sets.

Anyway, data wrangling have recently posted a huge list of links to data sets on the web which is certainly an interesting starting point.

And yes, if you already have your method, you might also find some interesting "real world application" there.

React to this post