MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Machine Learning and Data Sets

I’ve been busy taking care of my 11 month old daughter lately which leaves almost no time to do something as remotely useful as posting on my blog - not that I have been doing it more often when I was still working full time. At the same time you get a lot of ideas and potentially interesting insights, now that your brain has time to idle now and then, for example while picking up toys thrown to the ground again and again.

Anyway, I lately came to think that machine learners as a whole should devote much more time to working with actual data. In particular machine learners who think of themselves as being “method guys” (yes, this also includes me). It usually works likes this: You have some technique you really like a lot and you use it to extend an already existing method until you come up with something you think is really neat. It may have some interesting properties other algorithms don’t have, and you really would like to write a paper on it.

But then, the problem starts, because in order to prove that your extension is actually useful, you will have to prove that it makes a difference practically. So you go around your group asking colleagues if they have or know of some intersting data set. We call this the “have method, need data” phenomenon.

Of course, if you had started with a concrete application in mind, you would never have to ask yourself “oh, this is great, but what is it good for?”

Also, in machine learning, the formally defined problems we have are very abstract (like minimizing the expected risk from i.i.d. drawn data points), and many of the actual challenge are actually only “defined” by specific data sets.

Anyway, data wrangling have recently posted a huge list of links to data sets on the web which is certainly an interesting starting point.

And yes, if you already have your method, you might also find some interesting “real world application” there.

Steve Yegge

I recently stumbled upon (in its actual meaning, not using the website) over Steve Yegge’s blog. Actually it was via some page on emacs lisp which I’m unable to retrieve now. In any case, his Tour de Babel tour of programming languages is quite funny, as most of his other posts are. If you have some time to kill, I can heartily recommend his blog. Also, don’t miss his talk at OSCON 2007.

Why Ruby > Python

Lately, I’ve been hacking around a bit with ruby and I must confess, I’ve started to like it quite much. In particular, there are some things which I like much better than in python. And no, this won’t be all about the enforcing nice syntax.

So here are the things I consider a big win:

  • Extendable standard classes There is some nice feature for strings which the ruby developers missed? You can just add it to the String class. No need to derive your special MyStrings, or anything.

  • Blocks Okay, it took a while to get used to this, but once you understood how it works, blocks allow you to extend the ruby language by new syntactic constructions (well, mostly loops, but you can also use them for resource management. Just pass out a handle to an object and take care of the proper cleanup afterwards. Ruby’s open can be used like this). Also, I find blocks much cleaner than iterators, since the whole loop logic is encoded at a single place (instead of being split over two or three functions)

  • Function calls without parenthesis Again, you can write new functions and really extend the ruby syntax. Paired with introspection, you can write quite powerful class modifiers and call them in a clean, simple syntax. This is heavily used by rails, for example.

Okay, for me the biggest problem with ruby is a lack of a large numeric and matrix library. Python with its scipy it clearly ahead in this respect. There exist ruby bindings of the GNU Scientific Library, but it lacks the lapack functions which means that, for example, the eigenvalue functions are not really fast.

That, and that ruby is reported to be somewhat slower than python. But maybe that changes with the next version which will include a virtual machine… .