Friday, March 12, 2010

What I have learned from Twimpact

One thing I’ve learned from working on Twimpact is that in connected, distributed code, you need to take error recovery seriously and make sure that your program does not just crash.

Until we embarked on our little Twitter experiment called twimpact, most of my research was on core machine learning. I rarely left the confines of matlab, most of my data was either generated on the spot, or resided in text files somewhere on my computer. The worst that could have happened was that I ran out of memory or out of time. In both cases, there was little I could do about it.

As a consequence, error recovery wasn’t very far up on my list of things to take care of. It probably wasn’t even on there. As soon as the program crashed I had to go in there anyway.

However, once we started to crawl Twitter for retweets, I learned very quickly that you couldn’t deal with errors in the crawler like this. The crawler ran 24 hours a day, and most errors happened when I wasn’t in the office. Also, Twitter’s search infrastructure which we used for our crawling in the beginning was extremely brittle. Depending on which server served your request, the request might just time out, or return totally outdated information. Basically, any possible way in which an http request could fail eventually turned up.

Twitter has gotten a lot more stable since then and the streaming API we’ve recently switched to is also very robust, but still, every once in a while, something happens.

So here is my piece of advice, which is probably common knowledge for everyone working with networking code: When your code talks to other servers, you have to guard yourself against random failure of practically any component. Note that this is not restricted to hardware failures. More often, servers are taken down for maintenance in order to deploy a new version, leading to connection resets, and a short but noticable down time.

To make that a bit more concrete, a few tips:

  1. Catch the relevant exceptions and look out for error conditions. This may sound pretty trivial, but sometimes it is a bit hard to find a good spot from where you can restart the computation. If you’re brave, you can catch everything and put an infinite loop around your main code, but you probably would want to deal internal errors differently (or not, depends on the application).

  2. Have a meaningful log. You need to watch out for errors even if you have recovered successfully. Be careful, as log files can become rather large.

  3. If the remote server is failing, a common strategy is just to wait for a short period of time and retry. Twitter recommends a geometric backing off. You start with a short sleep interval and double it until you reach a certain threshold. This way, you make sure that the server won’t be under severe stress once it is back online.

Posted by at March 12, 2010, 16:30.

Wednesday, March 10, 2010

JRuby, JDBC, and DBI

When you connect to a database with JRuby, you usually go through something like ActiveRecord, part of the Ruby on Rails framework. However, if you just want to issue a single SQL command, this might be much too complex, in particular if you take the start-up time into account.

It turns out that you can very conveniently “hit the database” with dbi, Ruby’s Direct Database Interface. However, documentation on how to do this exactly are pretty scarce.

The following code connects to the ‘foobar’ PostgreSQL data base on localhost using the user ‘foobar’ and password ‘quux’:

# gems you need
#
# dbi
# dbd-jdbc
# jdbc-postgres
 
require 'java'
 
require 'rubygems'
require 'dbi'
require 'dbd/Jdbc'
require 'jdbc/postgres'
 
# connect to database 'foobar'
# with user 'foobar' and password 'quux'
DBI.connect('DBI:Jdbc:postgresql://localhost/foobar',
            'foobar', 'quux',
            'driver' => 'org.postgresql.Driver') do |dbh|
  puts "Connected"
  
end

(I also put this code down in a gist)

The actual complexity comes from the fact that you actually need three different gems: dbi for the database access, dbd-jdbc as a JDBC driver for dbi, and jdbc-postgres, the actual JDBC driver for PostgreSQL packaged as a ruby gem. Ah, and then you also need to know the name of the driver class.

In order to install these, you type

jruby -S gem install dbi dbd-jdbc jdbc-postgres

Once you have connected to the data base, you can issue normal SQL commands, see the excellent DBI tutorial. For example, you could count all entries of the table “phone_numbers” with

dbh.select_one "SELECT count(*) FROM phone_numbers"

Finally, here is how to connect to a MySQL data base:

DBI.connect("DBI:Jdbc:mysql://localhost/foobar", 
                'foobar', 'quux'
                'driver' => 'com.mysql.jdbc.Driver')

You also need to install the JDBC driver by installing the jdbc-mysql gem. The jruby-extras project contains more drivers for other data bases as well.

Posted by Mikio L. Braun at March 10, 2010, 17:05.

Tuesday, March 09, 2010

All Shiny and New

I’ve finally found the time and level of frustration with blogger to move my blog to a new platform. Actually, it’s not so much of a platform, but a little script which generates my blog as a list of static pages called Jekyll.

Blogger was nice to begin with, but the edit window was always too small, and having to write you posts in HTML felt so 1990s. Jekyll on the other hand lets you use one of a number of different wiki-style mark-ups, which just feels so much better.

As a little extra, I installed jsMath, which allows me to typeset real LaTeX like this $f(x) = \sum_{n=1}^\infty x^n$.

I just couldn’t migrate all the comments, but given the fact that there was only little amount of discussion anyway, that’s probably not that much of a problem. The old blog can be found again over at blogspot, just in case.

Posted by at March 9, 2010, 12:24.

older posts