Monday, December 10, 2012
File under: machine room
The past few weeks we’ve been busy extracting the real-time engine behind our Twitter analysis stuff, resulting in streamdrill. At heart it is a stream mining algorithm behind a simple REST interface to quickly solve the “top-K problem” of finding the most active items over different timescales in real-time. As usual in stream processing algorithms, events are processed as data comes in such that queries are instantaneous. No more waiting minutes for that map-reduce job to finish, the answer is just there.
In addition, streamdrill employs the kind of automatic resource management I’ve often talked about (for example, here). In this scheme, you specify how many elements you want to keep in memory and least active entries are replaced to make room for new ones. If you are concerned about the approximative nature of this kind of analysis, be sure to read this blog post where I explain why I believe exactness is not always essential.
So what does it look like? Streamdrill aggregates activities from an event stream. You pipe in your events and get the trends for the different timescales. Events consists of a number of fields we call entities.
In addition, streamdrill defines indices for the different entities of an event such that you can quickly drill down on your trends.
Just to give you an example, it’s really simple to build a basic Twitter retweet analysis using streamdrill.
We get tweets from the public Twitter stream API and extract the id of the retweeted tweet and the user from the data. This only works for API Retweets, but is good enough for this demo. The resulting trend looks like this (go to demo.streamdrill.com for a live demo):
If we filter down on retweets from OneDirection (where is Justin Bieber if you need him?), we get the following list:
A little feature we’ve built in is the link template where you can display a link in the streamdrill dashboard constructed from the event data (little arrow on the left). In our case, we link back to the original tweet:
(If you wonder about the discrepancy between the counters, we had just restarted the analysis from midnight today, and you only get a subsample of the full feed without paying for it)
So you get a full featured Twitter retweet analysis with a few lines of code, where the hardest part is figuring out Twitter authentication.
Posted by Mikio L. Braun at 2012-12-10 17:04:00 +0100