MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Machine Learning and Composability

One thing I find quite peculiar about machine learning research is that it is so much about constantly finding new algorithms for a small set of well known problems.

For example, if you consider supervised learning, there are the old statistically inspired linear methods, artificial neural networks, decision trees, support vector machines, Gaussian processes. The list goes on and on.

Of course, every of the above approaches has its own set of advantages and disadvantages, such that there is no clear winner. And every of those algorithms also stands for a certain school of thought, a certain meta-framework for constructing learning algorithms. John Langford has a nice list of such approaches on his blog.

The reason why this strikes me as peculiar is that in general in computer science, there is a much larger emphasis on finding new kinds of abstractions leading to reusable components on which others can build. Powerful new abstractions allow you to do more complex things with less code, and might ultimately even completely change the way you approach a certain problem.

Everything from numerical libraries, GUI frameworks, web frameworks, collection libraries, file systems, or concurrency libraries provide a certain abstraction for some funtionality, relieving the programmer of the burden to reinvent the wheel.

For some reason, this is not happening in machine learning, at least not to the extent it happens in general computer science. As a consequence, our overall power of dealing with complex problems has not increased significantly, and often, you find that you have to design a learning system more or less from scratch for new problems. So instead of taking several components and compose a new learner, you might have to start with modifying an existing cost function, and implement a specialized optimizer, or you have to start out with a certain graphical model and come up with appropriate approximations to make learning tractable.

However, there might be a reason why machine learning research simply isn’t that composable as normal computer science:

  • ML is actually a well-defined subset of computer science, such that there really only are a few number of problems to solve within the domain of ML. The same is probably true of other fields like optimization or solving differential equations.

  • It might be hard to design learning algorithms in terms of well-defined, loosely coupled components because they deal with inherently noisy data. Controlling the propagation of error might be difficult, such that it is hard to find good abstractions which are widely applicable independently of their

  • ML is also a lot about inventing new approaches to designing learning algorithms. Inspirations come from a number of places, like statistics, physics, or biology. These are also abstractions, but not on a level of a piece of code, more on a meta-level.

  • ML is quite complex and the right abstractions haven’t been found yet.

There also exist some examples of machine learning methods which build on other algorithms:

In summary, I think that finding new abstractions is important because it gives you more power to build complex things while keeping the mental complexity on a same level. And to me it seems that the potential is not fully explored in machine learning.

What I have learned from Twimpact

One thing I’ve learned from working on Twimpact is that in connected, distributed code, you need to take error recovery seriously and make sure that your program does not just crash.

Until we embarked on our little Twitter experiment called twimpact, most of my research was on core machine learning. I rarely left the confines of matlab, most of my data was either generated on the spot, or resided in text files somewhere on my computer. The worst that could have happened was that I ran out of memory or out of time. In both cases, there was little I could do about it.

As a consequence, error recovery wasn’t very far up on my list of things to take care of. It probably wasn’t even on there. As soon as the program crashed I had to go in there anyway.

However, once we started to crawl Twitter for retweets, I learned very quickly that you couldn’t deal with errors in the crawler like this. The crawler ran 24 hours a day, and most errors happened when I wasn’t in the office. Also, Twitter’s search infrastructure which we used for our crawling in the beginning was extremely brittle. Depending on which server served your request, the request might just time out, or return totally outdated information. Basically, any possible way in which an http request could fail eventually turned up.

Twitter has gotten a lot more stable since then and the streaming API we’ve recently switched to is also very robust, but still, every once in a while, something happens.

So here is my piece of advice, which is probably common knowledge for everyone working with networking code: When your code talks to other servers, you have to guard yourself against random failure of practically any component. Note that this is not restricted to hardware failures. More often, servers are taken down for maintenance in order to deploy a new version, leading to connection resets, and a short but noticable down time.

To make that a bit more concrete, a few tips:

  1. Catch the relevant exceptions and look out for error conditions. This may sound pretty trivial, but sometimes it is a bit hard to find a good spot from where you can restart the computation. If you’re brave, you can catch everything and put an infinite loop around your main code, but you probably would want to deal internal errors differently (or not, depends on the application).

  2. Have a meaningful log. You need to watch out for errors even if you have recovered successfully. Be careful, as log files can become rather large.

  3. If the remote server is failing, a common strategy is just to wait for a short period of time and retry. Twitter recommends a geometric backing off. You start with a short sleep interval and double it until you reach a certain threshold. This way, you make sure that the server won’t be under severe stress once it is back online.

JRuby, JDBC, and DBI

When you connect to a database with JRuby, you usually go through something like ActiveRecord, part of the Ruby on Rails framework. However, if you just want to issue a single SQL command, this might be much too complex, in particular if you take the start-up time into account.

It turns out that you can very conveniently “hit the database” with dbi, Ruby’s Direct Database Interface. However, documentation on how to do this exactly are pretty scarce.

The following code connects to the ‘foobar’ PostgreSQL data base on localhost using the user ‘foobar’ and password ‘quux’:

# gems you need
#
# dbi
# dbd-jdbc
# jdbc-postgres
 
require 'java'
 
require 'rubygems'
require 'dbi'
require 'dbd/Jdbc'
require 'jdbc/postgres'
 
# connect to database 'foobar'
# with user 'foobar' and password 'quux'
DBI.connect('DBI:Jdbc:postgresql://localhost/foobar',
            'foobar', 'quux',
            'driver' => 'org.postgresql.Driver') do |dbh|
  puts "Connected"
  
end

(I also put this code down in a gist)

The actual complexity comes from the fact that you actually need three different gems: dbi for the database access, dbd-jdbc as a JDBC driver for dbi, and jdbc-postgres, the actual JDBC driver for PostgreSQL packaged as a ruby gem. Ah, and then you also need to know the name of the driver class.

In order to install these, you type

jruby -S gem install dbi dbd-jdbc jdbc-postgres

Once you have connected to the data base, you can issue normal SQL commands, see the excellent DBI tutorial. For example, you could count all entries of the table “phone_numbers” with

dbh.select_one "SELECT count(*) FROM phone_numbers"

Finally, here is how to connect to a MySQL data base:

DBI.connect("DBI:Jdbc:mysql://localhost/foobar", 
                'foobar', 'quux'
                'driver' => 'com.mysql.jdbc.Driver')

You also need to install the JDBC driver by installing the jdbc-mysql gem. The jruby-extras project contains more drivers for other data bases as well.