What I have learned from Twimpact

Friday, March 12, 2010

One thing I’ve learned from working on Twimpact is that in connected, distributed code, you need to take error recovery seriously and make sure that your program does not just crash.

Until we embarked on our little Twitter experiment called twimpact, most of my research was on core machine learning. I rarely left the confines of matlab, most of my data was either generated on the spot, or resided in text files somewhere on my computer. The worst that could have happened was that I ran out of memory or out of time. In both cases, there was little I could do about it.

As a consequence, error recovery wasn’t very far up on my list of things to take care of. It probably wasn’t even on there. As soon as the program crashed I had to go in there anyway.

However, once we started to crawl Twitter for retweets, I learned very quickly that you couldn’t deal with errors in the crawler like this. The crawler ran 24 hours a day, and most errors happened when I wasn’t in the office. Also, Twitter’s search infrastructure which we used for our crawling in the beginning was extremely brittle. Depending on which server served your request, the request might just time out, or return totally outdated information. Basically, any possible way in which an http request could fail eventually turned up.

Twitter has gotten a lot more stable since then and the streaming API we’ve recently switched to is also very robust, but still, every once in a while, something happens.

So here is my piece of advice, which is probably common knowledge for everyone working with networking code: When your code talks to other servers, you have to guard yourself against random failure of practically any component. Note that this is not restricted to hardware failures. More often, servers are taken down for maintenance in order to deploy a new version, leading to connection resets, and a short but noticable down time.

To make that a bit more concrete, a few tips:

  1. Catch the relevant exceptions and look out for error conditions. This may sound pretty trivial, but sometimes it is a bit hard to find a good spot from where you can restart the computation. If you’re brave, you can catch everything and put an infinite loop around your main code, but you probably would want to deal internal errors differently (or not, depends on the application).

  2. Have a meaningful log. You need to watch out for errors even if you have recovered successfully. Be careful, as log files can become rather large.

  3. If the remote server is failing, a common strategy is just to wait for a short period of time and retry. Twitter recommends a geometric backing off. You start with a short sleep interval and double it until you reach a certain threshold. This way, you make sure that the server won’t be under severe stress once it is back online.

Posted by at 2010-03-12 16:30:00 +0100

blog comments powered by Disqus