Machine learning can exist more or less in an abstract space far away from any applications. I know what I’m talking about because I’ve spent a few years in that space. Data is always vectorial, examples come in benchmark data sets you know by heart, learning and prediction are always performed in batches and offline (and usually confined to some nested cross-validation loops).
On the other hand, once you start working with real applications you enter a totally different space. Data has to acquired, stored, and potentially processed in real-time. And you probably cannot do it in Matlab.
Twimpact is a very good example of what kinds of technologies you have to get used to analyse some data. Twimpact is analyzing retweets on twitter to do trending and impact analysis of users. It comes with a site where you can see live trends and browse around (have a look at the japanese site to see a running example). You might have noticed that we’ve recently shut down the live trending at twimpact.com. The main reason was that the site became unbearably slow. After a year of constantly monitoring retweets from twitter, or initial setup was not feasable anymore.
Initially, we started with a PostgreSQL data base and a little backend written in JRuby based on Ruby on Rails. Later on, we rewrote part of the front end in Groovy using the Grails framework. The whole thing was hosted on a dedicated server with a quite large four disc RAID 5. In initial tests, the RAID produced some whery impressive read and write rates of about 200MB/s.
Now, a year later the data base and the RAID became the biggest performance bottleneck. To understand why, you have to know that we’ve already analyzed several hunderd million tweets, which put a lot of pressure on the data base, and we need to match each new tweet against the whole history to find the matching retweet.
Currently, we’re getting a few million more retweets per day, and the time necessary to match a new tweet to the retweet, update the statistics and recompute the impact factors have become too large to keep up with the current rate. Part of the problem is also that we’re letting the database recompute the trend statistics using a single SQL query which has to go through roughly a hundred thousand retweets already for the hourly trends.
What we end up is several long-running SQL queries running on a database which sees 20-30 new tweets (each of which generates about a dozen queries). I’ve never seen such system loads before.
At some point, it became pretty apparent for us that we need an alternative, which lead us to look for other solutions. We eventually settled on Cassandra, because it seems to have the most momentum right now. In case, you don’t know, Cassandra is one of those newfangled NoSQL stores which loosens requirements on consistency and transactions to gain better scalability and speed. These NoSQL stores exist in all kinds of flavors, the main differences being whether it is in-memory (like memcached or redis), or persisted to disk, or whether it stores simple objects (basically byte-arrays), or provides more structure (like MongoDB, CouchDB, or Cassandra)
In addition, we also wanted to get away from the Java/JRuby/Groovy language mix, and settled for Scala which seems pretty promising in terms of expressibility and easy of integration with Java.
Finally, as a last step, we started to look into some messaging middleware. The advantage of using messaging is that the different parts of the system become more independent and modular, such that you can independently restart parts of the system, or add analysis modules on the fly.
In a certain way, our system already consisted of several independent processes which communicated quite implicitly over the PostgreSQL data base, and anything which puts less work on the data base seems fine to us. Also, at some point we might want to distribute twimpact on a cluster, which is a lot easier when you already use some messaging infrastructure.
Currently, we’re looking into ActiveMQ as the main message broker, possibly in addition with Camel, a library which allows to do more high-level routing, and Akka, a library for event-driven programming in Scala.
In short, once you’ll leave the confines of “pure” machine learning and want to build robust and scalable systems, there is a lot of exciting new technology to pick up. Our colleagues have started to drop in our offices, look at some printout and ask “What is Scala?” or “What is Cassandra?”. I’m pretty sure they think we’re making all those projects up as we go along ;)
Posted by Mikio L. Braun at 2010-07-15 12:00:00 +0000
blog comments powered by Disqus