MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

What's happening over at Twimpact

Machine learning can exist more or less in an abstract space far away from any applications. I know what I’m talking about because I’ve spent a few years in that space. Data is always vectorial, examples come in benchmark data sets you know by heart, learning and prediction are always performed in batches and offline (and usually confined to some nested cross-validation loops).

On the other hand, once you start working with real applications you enter a totally different space. Data has to acquired, stored, and potentially processed in real-time. And you probably cannot do it in Matlab.

Twimpact is a very good example of what kinds of technologies you have to get used to analyse some data. Twimpact is analyzing retweets on twitter to do trending and impact analysis of users. It comes with a site where you can see live trends and browse around (have a look at the japanese site to see a running example). You might have noticed that we’ve recently shut down the live trending at twimpact.com. The main reason was that the site became unbearably slow. After a year of constantly monitoring retweets from twitter, or initial setup was not feasable anymore.

The Initial Setup

Initially, we started with a PostgreSQL data base and a little backend written in JRuby based on Ruby on Rails. Later on, we rewrote part of the front end in Groovy using the Grails framework. The whole thing was hosted on a dedicated server with a quite large four disc RAID 5. In initial tests, the RAID produced some whery impressive read and write rates of about 200MB/s.

Now, a year later the data base and the RAID became the biggest performance bottleneck. To understand why, you have to know that we’ve already analyzed several hunderd million tweets, which put a lot of pressure on the data base, and we need to match each new tweet against the whole history to find the matching retweet.

Currently, we’re getting a few million more retweets per day, and the time necessary to match a new tweet to the retweet, update the statistics and recompute the impact factors have become too large to keep up with the current rate. Part of the problem is also that we’re letting the database recompute the trend statistics using a single SQL query which has to go through roughly a hundred thousand retweets already for the hourly trends.

What we end up is several long-running SQL queries running on a database which sees 20-30 new tweets (each of which generates about a dozen queries). I’ve never seen such system loads before.

Buzzword Bingo

At some point, it became pretty apparent for us that we need an alternative, which lead us to look for other solutions. We eventually settled on Cassandra, because it seems to have the most momentum right now. In case, you don’t know, Cassandra is one of those newfangled NoSQL stores which loosens requirements on consistency and transactions to gain better scalability and speed. These NoSQL stores exist in all kinds of flavors, the main differences being whether it is in-memory (like memcached or redis), or persisted to disk, or whether it stores simple objects (basically byte-arrays), or provides more structure (like MongoDB, CouchDB, or Cassandra)

In addition, we also wanted to get away from the Java/JRuby/Groovy language mix, and settled for Scala which seems pretty promising in terms of expressibility and easy of integration with Java.

Finally, as a last step, we started to look into some messaging middleware. The advantage of using messaging is that the different parts of the system become more independent and modular, such that you can independently restart parts of the system, or add analysis modules on the fly.

In a certain way, our system already consisted of several independent processes which communicated quite implicitly over the PostgreSQL data base, and anything which puts less work on the data base seems fine to us. Also, at some point we might want to distribute twimpact on a cluster, which is a lot easier when you already use some messaging infrastructure.

Currently, we’re looking into ActiveMQ as the main message broker, possibly in addition with Camel, a library which allows to do more high-level routing, and Akka, a library for event-driven programming in Scala.

In short, once you’ll leave the confines of “pure” machine learning and want to build robust and scalable systems, there is a lot of exciting new technology to pick up. Our colleagues have started to drop in our offices, look at some printout and ask “What is Scala?” or “What is Cassandra?”. I’m pretty sure they think we’re making all those projects up as we go along ;)

MLOSS workshop at ICML 2010

Last friday we held our 3rd machine learning open source software (MLOSS). While the previous meetings had been at NIPS, we opted for ICML this time. The workshops were held in a smallish meeting room on the 20th floor of Hotel Dan Panorama in Haifa, Israel.

Several constraints led me to travel to Israel just very briefly: I arrived on the Thursday evening just in time to have dinner with our invited speakers and left in the evening after the workshop at 9pm with a Taxi for the airport at Tel-Aviv. Unfortunately, we had to stop in Brussels on our way back. We arrived at five in the morning, all the shops still closed, but an amazing influx of passengers nevertheless. This was also the first time I took a nap on some benches at the airport.

All in all, it was a nice trip with a very nice workshop. Head over to the blog at mloss.org where I summarized the workshop.

Eventually, videos will be put online at videolectures.net. Till then, have a look at some pictures of our MLOSS workshop

Companion Objects as Classes in Scala

Here is a little pattern I came across. It basically shows how companion objects work like class objects in Ruby and how type inference makes working with such types quite painless. I’m not sure if this is already widely known but a quick search on Google doesn’t reveal anything similar.

Types are not really first-class objects in Scala (and in Java, too). In generics, they are removed at compile time through type erasure, and you cannot simply pass a class to a method by saying method(ClassName).

Most of the time, this is not an issue, but sometimes it would be very handy to pass a class to a method, for example when you need to create new objects of a given type. One example I came across was when working on the new Cassandra-based backend for twimpact. By default, Cassandra only supports storing byte arrays, and we needed some infrastructure to serialize objects into byte arrays and back (without using standard Java serialization). Now serializing an object is simple enough: you just write a trait which provides a function for serializing. However, deserializing is a bit harder because you don’t have an object available. So the question is, how does the program know how to deserialize?

In Ruby, you would probably just pass the class object and work with the class methods like this:

class StoredNumber
  def self.from_byte_array(bytes)
    i = bytes[3] << 24 | bytes[2] << 16 | bytes[1] << 8 | bytes[0]
    StoredNumber.new(i)
  end

  def initialize(i)
    @value = i
  end

  def to_byte_array
    [ (@value >> 24) & 0xff,
      (@value >> 16) & 0xff,
      (@value >> 8) & 0xff,
      @value & 0xff ]
  end
end

# and then, you can store a hyptothetical element to a store and
# convert it back by passing in the class:

Store.put(StoredNumber.new(3))

x = Store.get(StoredNumber)

The storing part could be done in the same way in Scala (of course with explicitly introducing a trait for the conversion part.)

trait ConvertsToBytes {
  def toByteArray(): Array[Byte]
}

class StoredNumber(value: Int) extends ConvertsToBytes {
  def toByteArray(): Array[Byte] =
     // same blah as above
}

// Store it like this
Store.put(new StoredNumber(42))

To retrieve an object, you could pass a converter function like this:

def numberFromBytes(bytes: Array[Bytes]): StoredNumber {
  // convert and extract from bytes
}

// this is how Store would have to be defined
class Store {
   // ...
   def get[T](convert: (Array[Bytes]) => T): T = //...
}

// Get a number (without type inference)
Store.get[StoredNumber](numberFromBytes)

You can actually drop the type parameter on the call to get

Store.get(numberFromBytes) // Type StoredNumber is inferred by Scala.

Still, this is not as elegant as the Ruby version because the class and the converter are separate entities.

The solution is to use another trait for the conversion back and let the companion object implement that trait:

// Note that we have to put in the result as a type parameter.
trait ConvertsFromBytes[T] {
  def fromBytes(bytes: Array[Byte]): T
}

// Note that we need to explicitly name StoredNumber when
// extending ConvertsFromBytes
object StoredNumber extends ConvertsFromBytes[StoredNumber] {
  def fromBytes(bytes: Array[Byte]): StoredNumber =
    // ... convert back from array
}

// This is how Store would have to be implemented now... .
class Store {
   def get[T](type: ConvertsFromBytes[T]): T = 
     // ... convert back using type.fromBytes()
}

// Without using type inference, we would have to say
Store.get[StoredNumber](StoredNumber)

// but with type inference, we can simply say the following
// which is just as compact as passing the class object.
Store.get(StoredNumber)

On closer inspection, the construction is very similar to the class objects in Ruby. The only difference is that you have to explicitly define a trait for the methods you expect, which isn’t so surprising after all. The rest is taken care of by type inference.