Wednesday, November 30, 2011

Scala discussion heating up?

File under: Machine Room

Apparently, the discussions about “Scala being too complex” are heating up, mostly due to a leaked email from one of Yammer’s programmers to the Scala people where he discusses some of his experiences he’s had with using Scala in a production environment, and the other being a post on HN comparing Scala to Perl in the sense that both languages have too much flexibility in solving a specific task leading to a mix of different programming paradigms and styles which will make you code harder to read and maintain.

Now we’ve been using Scala as our main programming language for the last two and half years for TWIMPACT, so I know what people are talking about. And the truth is, it is all true, sadly. On the one hand, Scala is a pretty awesome programming language which is very nicely designed. I’ve said this before, but normally you will eventually come across some feature of a programming language which is not designed well and you have to code your way around it, but I’ve yet to come about something like it in Scala.

On the other hand, it is also true that some of the libraries are not as fast as they should be. Although I like the idea of immutable collections a lot, every time I need performance, I’d rather put in a Java collection. Also, it’s true that the collection library is pretty complex. It all kind of makes sense to get a clean design of the classes, but it’s pretty complicated with all those classes like Seq, SeqLike, Traversable, TraversableOnce, etc. However, you’ll probably only need to know all the details if you want to write your own collections which integrate seamlessly with the existing collection classes.

It’s also true that upgrading to a new version is hard. For some reason, many libraries seem to be quite deeply interlocked with the Scala version. While our own code never had to be changed if Scala went to a new version, this wasn’t true for most libraries, unfortunately, meaning that you have to wait till all the libraries have been upgraded to the new version before you can do the update yourself. And frankly, I don’t see why this is necessary.

We’ve never bothered with sbt, but directly went for maven due to it’s better integration in most IDE’s. We’re using IntelliJ IDEA whose Scala plugin has come a long way and gives pretty good support. There is also a lot to be improved in the basic tools like the compiler or the shell in terms of startup time. Scala seems to preload several megabytes of jar files on startup, probably in an attempt at optimization, but in the end, it only means that starting Scala takes anywhere between 5 - 10 seconds which is really a lot if you’re working on the shell (and every other language starts up almost immediately) The guys behind JRuby have invested a lot of time to cut down on the startup time, and that was time well spent.

People are also often attacking Scala for it’s complexity. While it’s certainly true that it’s easier to hire some Java expert than someone who knows Scala, IMHO Scala is a big improvement in many ways over Java, which feels overly verbose once you’ve learned Scala. As with every language, there are more basic concepts and more advanced concepts and usually, you don’t have to master them all from the start. Also, people often argue as if the complexity about learning a programming language is all in the programming language, but you also have to consider the standard libraries and tools. For example, while the Java programming language is relatively simple in terms of concepts, the standard tools and frameworks are pretty intimidating to learn (all that XML, Maven, Spring, etc.)

Then people are also complaining about the community, which is supposedly not helpful enough, or too fragmented, or only consists of crazy people who are just thinking about how to implement everything in terms of category theory. I don’t think that is true. Scala is still young, and the community can still grow. We’ve uncovered a number of bugs (mostly Leo who has a knack for finding bugs in libraries) and people were mostly as responsive as you’d expect them to be. One of the strengths of Scala is also that it is quite painless to reuse existing Java projects (as any other programming language for the JVM). I never found it that repulsive as some seem to use a Java library from Scala. The integration is quite painless, and if you really have to, you can add a bit of syntactic sugar on your side for the stuff you need most.

Finally, I really don’t get the argument of people who are saying “Scala is too complex, I switched to Python (or some other scripting language)”. To me, these are completely different sets of programming languages. While it’s true that there are some applications like writing medium sized web sites which you can nowadays do in either a scripting language or a compiled language, there are many applications where Python (or any other scripting language) just can’t compete. In scripting languages, it’s hard to add primitive data types which are really fast unless someone else already took care of implementing the most computing-intensive routines in C.

So in summary, Scala is both awesome and awful, just like almost every piece of sufficiently advanced technology. You can work with Scala, and it’s a lot of fun, or you can reject it for a number of reasons, just acknowledge the complexity and don’t give in to hypes and marketing.

Updates: Coda Hale’s comment on the leaked email, Yammer’s official statement on Scala, Typesafe’s post on their committment to the industry.

Posted by Mikio L. Braun at Wed Nov 30 11:18:00 +0100 2011.

Tuesday, November 01, 2011

Analyzing Social Media Data

File under: Seminar Room

Analyzing social media has become quite popular. People have been predicting box office openings based on Twitter chatter, studied information diffusion patterns, information flows between classes of users, how real-world events like earthquakes are reflected in Twitter.

This is all pretty exciting and interesting, but there are also a few things where there is still room for improvement.

There is very little stuff on real-time analysis. Many papers boast with the hundreds of millions of tweets (and the access to Twitter’s firehose necessary to get that amount of data) which have formed the basis for the paper. However, many papers later introduce some more or less arbitrary ways of truncating the data, for example by taking a number of “most active users”. This is both true for Jure Leskovec’s paper as well as the Yahoo research’s paper.

However, I think that getting to real-time is extremely important, because you cannot just wait for days or longer to get your analysis. By that time, more data will have been streaming in, and when are you going to analyze that data?

Another problem with many of the analyses is that they focus on the positive cases only. Meaning that they develop some method to detect bursts or trends and then use some famous real-world example (like Japan winning the women’s soccer championship) to show that the method is triggered by the data. However, few publications go so far as to validate their method on negative examples as well, showing that the method not only detect trends well, but also does so robustly with few false positives.

A classical example is the highly cited 2003 paper by Jon Kleinberg ”Bursty and Hierarchical Structure from Streams” which explains how to detect areas of higher than usual activity, for example, from email streams. But then, the paper shows how the detected structure coincides with real deadlines for two examples without discussing negative examples in depth.

Many methods also seem to believe that an analysis which is based on hundreds of millions of data points is automatically true in general. While this is certainly true for simple statistics which you can estimate well, there are other methods which can overfit. And for those, as many other disciplines like bioinformatics have had to learn the hard way, as you get more data, the probability that you find some evidence for your hypothesis increases drastically.

To get reliable results, you need to follow the same rules as when validating the performance of a machine learning algorithm: Test on data which is disjoint from training data. If your method detects trends, check it on data which you believe has no structure. If you aggregate topics, check it on days when nothing special was happening. If you analyze the structure of the data, check on an independent sample (ideally from a period of time which is a bit removed from the original sample).

That way you might have less data available, but your results will improve a lot in terms of reliability.

Posted by Mikio L. Braun at Tue Nov 01 22:20:00 +0100 2011.

Monday, October 10, 2011

One does not simply scale into real-time

Real-time seems to be the next big thing in big data. Map-Reduced has shown how to perform big analyses on huge data sets in parallel, and the next challenge seems to be to find a similar kind of approach to real-time.

When you look around the web, there are two major approaches out there which try to building something which can scale to deal with Twitter-firehose-scale amounts of data. One is starting with a MapReduce framework like Hadoop and somehow finagle real-time or at least streaming capabilities on it. The other approach starts with some event-driven “streaming” computing architecture and makes it scale on cluster.

These are interesting and very cool projects, however from our own experience with retweet analysis at TWIMPACT, I get the feeling that both approaches fall short of providing a definitive answer.

In short: One does not simply scale into real-time.

Real-Time Stream Analysis

So what is real-time stream analysis? (Apart from the fact that I seem to be unable to decide whether to write it with a hyphen or as one word like Donaudampfschifffahrtskapitänsmütze).

Basically, the idea is that you have some sort of event stream like Twitter messages, or Apache log data, or URL expansion requests at bit.ly, which comes in at a high volume of several hundreds or even thousands of events per second. Let’s just focus on Twitter stream analysis for now.

Typical applications are to compute some statistics from the stream which summarizes it in a nice fashion. For example

  • what is the most frequently retweeted tweet?
  • what is the most frequently mentioned URL?
  • what are the most influental users (in terms of mentions/retweets)

These are pretty basic counting tasks. Very closely linked are questions like what the score of an arbitrary tweet or URL is, or what the first few hundred most influental users are, and so on. You can also compute more complex scores based on more of these numbers. For example, our own TWIMPACT score is based on all the counts of a user’s retweets, and something like the Klout score is also based on a number of statistics (I’m only assuming here, of course).

Since the data arrives in real-time, you typically also want to get the results in real-time (instead of a daily report which is already hours old when you get it). Ideally, you get updated scores with each event, also I’d say everything is ok if a query always takes less than a second.

Finally, you’d probably also want to be able to look at historical data to see how a user or retweet has performed in the past and to see whether its activity is going up or going down.

Just to give you an idea of the amount of data involved: Each tweet corresponds to about 1k of data (including all metadata). If we assume that we have about 1000 tweets per second (actually it’s probably more), then we get about 86.4 million tweets per day, or about 82.4GB of new data per day (about 30TB per year).

Now let’s discuss how you would approach this problem in a database centric fashion and using a stream processing framework.

Databases Approach

With “database approach” I try to cover a whole range including traditional relational databases, NoSQL databases and MapReduce frameworks. The common denominator is that you basically pipe your data into the database and use the built-in queries of the database to compute your statistics. Depending on the type of NoSQL database you are using you’ll probably have to do the analysis online to precompute all your statistics because you the database doesn’t have sufficient query capabilities.

As I see it, there are two main problems with this approach: First of all, the size of your database grows at a non-trivial rate. Unless you’re Google and you’ve planned for exponential growths of your data centers anyway, you will eventually run out of space and performance. Even if you assume that the reponse time will increase only logarithmically in your data size, your cluster will eventually become quite slow to deal with real-time data.

This will directly affect the time it will take for your queries to complete. However, as queries also put quite some load on your disks, this problem will only get worse over time.

At the same time, most of the data will likely be irrelevant for your current analysis. Tweets stop being retweeted, URLs fall out of fashion. In other word, there is a huge amount of historical baggage clogging up your servers. It is true that you still need this data in order to compute historical statistics, but note that the data doesn’t change anymore, so that it would make much more sense to analyse the data once and only put the results in some read-only storage.

You could of course try to keep the database size constant by periodically discarding old data. I think this is the right approach and I’ll discuss it in more detail below. Note, however that many databases cannot really deal well with huge amount of deletions, as they need some form of more or less disruptive garbage collection (vacuum, compaction, etc.) to actually free up the space.

Finally, one should also not forget that MapReduce doesn’t work with all kinds of problems, but only with problems which are already inherently easy to parallelize. Actually, if you look into research in parallel algorithms, you will see that almost no problems scale linearly in the number of available processors (and you will also see that many of the efficient algorithms work with shared mutable state! Ugh!)

Stream Processing

So if storing all your data on slow disks to process it later is the wrong approach, you should probably try to process the data as it comes in through some form of pipeline. This is more or less the idea behind stream processing. Two example frameworks are Storm, originally developed by BackType (recently acquired by Twitter), and S4 developed by Yahoo.

If you look closer, these are basically quite sophisticated frameworks to scale an actor based approach to concurrency. If you’re not familiar with it, it’s the idea to structure some computation in terms of independent small pieces of code which do not share state and communicate with one another through messages.

Frameworks like the ones above let you define a computation in terms of a number of processing nodes which may also run on different servers, with the ability to add more parallel workers of a certain kind on the fly to scale up processing resources where necessary.

This approach is essentially orthogonal to the database approach. In fact, stream processing frameworks usually don’t even deal with persistence, they only focus on the computation. Therefore, there is also nothing specific to real-time stream processing in these frameworks. In essence, they deal with the question of how to split up a computation into small independent parts and how to scale such a network on a cluster.

To me, the basic problem with this approach (and with actor based concurrency) is that it doesn’t deal well with peak volumes which surpass the computation bandwith. In fact, it’s even not so simple to say given such a network what the maxium throughput is. Conceptually, it is the throughput of the slowest component, but you also have to take the message routing topology into account.

Now, once more messages need to be processed than possible, somewhere in the system messages queues are starting to fill up. This is a general problem with such systems, not specific to actor based concurrency. The book Release It! contains some very entertaining and also frightening real-world war stories of what can go wrong with systems where you plug together components with quite different capacities.

Another problem is actor based concurrency tends to break a simple function which may fit on a single screen into dozens of classes, but that is another problem.

In any case, the question is how you can guarantee that your system will run stably even for high peak volumes, apart from just adding more nodes? The easiest thing would be to randomly drop events (resulting in a more or less consistent subsample which at least has the same distribution as the original data stream), but is there something different you could do?

Stream Distillation

So to summarize: Putting all your data into a database is problematic because the data steadily grows and computing statistics based on the data is too slow. You also don’t really need to keep all your data at hand to have an analysis of the current state of the stream.

Stream processing, on the other hand, is a nice tool to scale your computations, but it doesn’t deal well with peak volumes, and depending on how you persist your data, you run into the same scaling issues as the database centric approach.

In other words, what we need is a way to separate the data we currently need for our analysis from the historic data and analyses, and we also need a way to limit the bandwidth of events while having as little distortion as possible on the statistics we’re interested in.

As it turns out, these kinds of problems are closely related to “frequent itemset” problems in data mining. The idea is to identify the top most frequent items in a data stream with limited resources, both in terms of memory and computation. Such methods are discussed, for example, in Chapter 4 of Rajaraman and Ullman’s upcoming book Mining of Massive Datasets.

Most of these methods work by keeping a fixed amount of counts and then replacing the least frequent ones when a new type of event occurs. They come with some sort of theoretical guarantee to identify the correct set of items, or at least some bound on the differences in the count. However, when analysing retweets you get quite good results because only a fraction of the retweets is retweeted more than once, so that most of the slots are occupied by retweets which do not reoccur and discarding them doesn’t hurt.

Summary

As you might have expected, this is exactly the approach we took with TWIMPACT. All the demos you can see at our site are powered by our new trending backend (codename “Trevor”), which is basically an in-memory trend database built using ideas from stream mining, together with read-only snapshots on disk for later historical analyses.

These ideas can of course also be combined with databases and stream processing, but already without a huge amount of parallelism, we’re able to process a few thousand tweets per second.

In summary: it’s not enough to just scale your database needs and your computational algorithms, to make your analysis framework stable against peak volumes and sustainable in terms of data growth, you also need to separate your analysis database from the historic data and add some ideas from stream mining to extract a statistically approximate subsample with capped bandwidth from your data.

Posted by Mikio L. Braun at Mon Oct 10 22:15:00 +0200 2011.

older posts