MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

One does not simply scale into real-time

Real-time seems to be the next big thing in big data. Map-Reduced has shown how to perform big analyses on huge data sets in parallel, and the next challenge seems to be to find a similar kind of approach to real-time.

When you look around the web, there are two major approaches out there which try to building something which can scale to deal with Twitter-firehose-scale amounts of data. One is starting with a MapReduce framework like Hadoop and somehow finagle real-time or at least streaming capabilities on it. The other approach starts with some event-driven “streaming” computing architecture and makes it scale on cluster.

These are interesting and very cool projects, however from our own experience with retweet analysis at TWIMPACT, I get the feeling that both approaches fall short of providing a definitive answer.

In short: One does not simply scale into real-time.

Real-Time Stream Analysis

So what is real-time stream analysis? (Apart from the fact that I seem to be unable to decide whether to write it with a hyphen or as one word like Donaudampfschifffahrtskapitänsmütze).

Basically, the idea is that you have some sort of event stream like Twitter messages, or Apache log data, or URL expansion requests at bit.ly, which comes in at a high volume of several hundreds or even thousands of events per second. Let’s just focus on Twitter stream analysis for now.

Typical applications are to compute some statistics from the stream which summarizes it in a nice fashion. For example

  • what is the most frequently retweeted tweet?
  • what is the most frequently mentioned URL?
  • what are the most influental users (in terms of mentions/retweets)

These are pretty basic counting tasks. Very closely linked are questions like what the score of an arbitrary tweet or URL is, or what the first few hundred most influental users are, and so on. You can also compute more complex scores based on more of these numbers. For example, our own TWIMPACT score is based on all the counts of a user’s retweets, and something like the Klout score is also based on a number of statistics (I’m only assuming here, of course).

Since the data arrives in real-time, you typically also want to get the results in real-time (instead of a daily report which is already hours old when you get it). Ideally, you get updated scores with each event, also I’d say everything is ok if a query always takes less than a second.

Finally, you’d probably also want to be able to look at historical data to see how a user or retweet has performed in the past and to see whether its activity is going up or going down.

Just to give you an idea of the amount of data involved: Each tweet corresponds to about 1k of data (including all metadata). If we assume that we have about 1000 tweets per second (actually it’s probably more), then we get about 86.4 million tweets per day, or about 82.4GB of new data per day (about 30TB per year).

Now let’s discuss how you would approach this problem in a database centric fashion and using a stream processing framework.

Databases Approach

With “database approach” I try to cover a whole range including traditional relational databases, NoSQL databases and MapReduce frameworks. The common denominator is that you basically pipe your data into the database and use the built-in queries of the database to compute your statistics. Depending on the type of NoSQL database you are using you’ll probably have to do the analysis online to precompute all your statistics because you the database doesn’t have sufficient query capabilities.

As I see it, there are two main problems with this approach: First of all, the size of your database grows at a non-trivial rate. Unless you’re Google and you’ve planned for exponential growths of your data centers anyway, you will eventually run out of space and performance. Even if you assume that the reponse time will increase only logarithmically in your data size, your cluster will eventually become quite slow to deal with real-time data.

This will directly affect the time it will take for your queries to complete. However, as queries also put quite some load on your disks, this problem will only get worse over time.

At the same time, most of the data will likely be irrelevant for your current analysis. Tweets stop being retweeted, URLs fall out of fashion. In other word, there is a huge amount of historical baggage clogging up your servers. It is true that you still need this data in order to compute historical statistics, but note that the data doesn’t change anymore, so that it would make much more sense to analyse the data once and only put the results in some read-only storage.

You could of course try to keep the database size constant by periodically discarding old data. I think this is the right approach and I’ll discuss it in more detail below. Note, however that many databases cannot really deal well with huge amount of deletions, as they need some form of more or less disruptive garbage collection (vacuum, compaction, etc.) to actually free up the space.

Finally, one should also not forget that MapReduce doesn’t work with all kinds of problems, but only with problems which are already inherently easy to parallelize. Actually, if you look into research in parallel algorithms, you will see that almost no problems scale linearly in the number of available processors (and you will also see that many of the efficient algorithms work with shared mutable state! Ugh!)

Stream Processing

So if storing all your data on slow disks to process it later is the wrong approach, you should probably try to process the data as it comes in through some form of pipeline. This is more or less the idea behind stream processing. Two example frameworks are Storm, originally developed by BackType (recently acquired by Twitter), and S4 developed by Yahoo.

If you look closer, these are basically quite sophisticated frameworks to scale an actor based approach to concurrency. If you’re not familiar with it, it’s the idea to structure some computation in terms of independent small pieces of code which do not share state and communicate with one another through messages.

Frameworks like the ones above let you define a computation in terms of a number of processing nodes which may also run on different servers, with the ability to add more parallel workers of a certain kind on the fly to scale up processing resources where necessary.

This approach is essentially orthogonal to the database approach. In fact, stream processing frameworks usually don’t even deal with persistence, they only focus on the computation. Therefore, there is also nothing specific to real-time stream processing in these frameworks. In essence, they deal with the question of how to split up a computation into small independent parts and how to scale such a network on a cluster.

To me, the basic problem with this approach (and with actor based concurrency) is that it doesn’t deal well with peak volumes which surpass the computation bandwith. In fact, it’s even not so simple to say given such a network what the maxium throughput is. Conceptually, it is the throughput of the slowest component, but you also have to take the message routing topology into account.

Now, once more messages need to be processed than possible, somewhere in the system messages queues are starting to fill up. This is a general problem with such systems, not specific to actor based concurrency. The book Release It! contains some very entertaining and also frightening real-world war stories of what can go wrong with systems where you plug together components with quite different capacities.

Another problem is actor based concurrency tends to break a simple function which may fit on a single screen into dozens of classes, but that is another problem.

In any case, the question is how you can guarantee that your system will run stably even for high peak volumes, apart from just adding more nodes? The easiest thing would be to randomly drop events (resulting in a more or less consistent subsample which at least has the same distribution as the original data stream), but is there something different you could do?

Stream Distillation

So to summarize: Putting all your data into a database is problematic because the data steadily grows and computing statistics based on the data is too slow. You also don’t really need to keep all your data at hand to have an analysis of the current state of the stream.

Stream processing, on the other hand, is a nice tool to scale your computations, but it doesn’t deal well with peak volumes, and depending on how you persist your data, you run into the same scaling issues as the database centric approach.

In other words, what we need is a way to separate the data we currently need for our analysis from the historic data and analyses, and we also need a way to limit the bandwidth of events while having as little distortion as possible on the statistics we’re interested in.

As it turns out, these kinds of problems are closely related to “frequent itemset” problems in data mining. The idea is to identify the top most frequent items in a data stream with limited resources, both in terms of memory and computation. Such methods are discussed, for example, in Chapter 4 of Rajaraman and Ullman’s upcoming book Mining of Massive Datasets.

Most of these methods work by keeping a fixed amount of counts and then replacing the least frequent ones when a new type of event occurs. They come with some sort of theoretical guarantee to identify the correct set of items, or at least some bound on the differences in the count. However, when analysing retweets you get quite good results because only a fraction of the retweets is retweeted more than once, so that most of the slots are occupied by retweets which do not reoccur and discarding them doesn’t hurt.

Summary

As you might have expected, this is exactly the approach we took with TWIMPACT. All the demos you can see at our site are powered by our new trending backend (codename “Trevor”), which is basically an in-memory trend database built using ideas from stream mining, together with read-only snapshots on disk for later historical analyses.

These ideas can of course also be combined with databases and stream processing, but already without a huge amount of parallelism, we’re able to process a few thousand tweets per second.

In summary: it’s not enough to just scale your database needs and your computational algorithms, to make your analysis framework stable against peak volumes and sustainable in terms of data growth, you also need to separate your analysis database from the historic data and add some ideas from stream mining to extract a statistically approximate subsample with capped bandwidth from your data.

Peer Review and NoSQL

Disclaimer: This post definitely falls into the tl;dr categories of posts. I’ve been collecting these ideas for quite some time now, and somehow this post got longer and longer. Anyway, it is a complex topic.

Ever since I started to work in an academic environment back in ‘96 (working as a student in Rolf Eckmiller’s neuroinformatics group), the peer review process has always been a big topic. There were always some people complaining, discussing possible ways to improve it, or dismissing the whole idea of peer review at all.

The interesting thing is that very little has changed since then. If we look not only at peer review but the whole scientific publication landscape, you can see a few significant changes. For example, open access is much more real than it used to be, and there are more scientific journals which let authors keep their full copyright and the right to republish the papers on their webpages, etc.

These changes are all important, but what I find curious is that the general peer review process hasn’t changed at all. The process to get published or accepted at a conference is still the same: You submit your paper to some board where it gets handed to two or more reviewers whose identity is not revealed to you. Based on the verdict of the reviewers the action editor/program chair decides what to do with your paper.

I won’t repeat all the things which people perceive as being broken with this system. Let’s just say that the process has a high error rate, both false positives and false negatives (a.k.a. the “bad review problem”), it can take very long for a paper to get published, and the workload on the reviewers is pretty high.

Still, nothing has changed. Is this only because we sort of bring this problem upon ourselves (as opposed to the system being forced upon us by some external agency)? Or is there a deeper reason?

A Closer Look at Peer Review

I think the main reason why peer review is so resilient is that it is already a quite elegant partial solution to a number of interwoven requirements. In other words, just like democracy, it is an imperfect system but the best we have discovered so far.

In this whole scientific publication business, there are a number of stakeholders involved:

  • Science as a whole wants to progress to solve the smaller and bigger mysteries of life, the universe and all the rest. Science needs the publication system to be efficient, fair, and open, such that information can be distributed quickly and without bias.

  • Researchers want to have fun researching, but also need to build up a reputation to keep doing so. For this, they need the publication system to build up a track record of their work.

  • Researchers also need to have access to the works of others, to know what has already been done, which problems have been solved, and so on. The publication system is basically like an enormous, ever expanding library of knowledge.

  • Universities and funding agencies need the publication system to asses the scientific output of researchers for hiring decisions and to explain to tax payers how and why the money has been spent. Peer review is a very handy way of assessing scientific output. You can use it to basically just say “I don’t know exactly what they have been doing (and it’s probably not even practically relevant for another decade or so), but at least these other researchers said that it’s good.”

  • Publishers mainly want/need to make money (and probably also have a name in the whole scientific endeavor).

What’s important to understand, is however how peer review addresses all of these problems at least partially:

  • Exchange of information is more or less efficient, fair, and open. It could be more efficient, but the publication lag is still on the same order of magnitude as the actual work. It’s not like science is already five years ahead of a huge backlog of publications (at least I hope that’s not the case…) It is fair, because a good reviewer is bound by a scientific code of ethic to be a fair and unbiased, and it is open because everyone can submit something (as opposed to a closed club where you first have to become a member to get a chance to publish.)

  • Researchers get an excellent standardized measure of scientific output. A published journal paper is something nobody can take away from you. Not only is a published paper an important step towards tenure, it is also something everyone agrees on. On the other hand, peer review gives you a level of filtering such that the amount of new results is just large enough to be manageable.

  • Universities and funding agencies are also happy, because they have a solid, generally accepted measure of scientific productivity, which even laymen can understand.

  • Publishers can get their share by building a strong brand, becoming a journal with a high impact (while having researchers doing the actual peer review work for free, but that is another problem).

In summary, peer review is an okay solution to a complex problem, and whatever solution you propose to replace it has to cover all of these aspects as well.

You can’t ignore the complexity of the problem

You’re probably wonder when I’ll come to the NoSQL bit of this post, but before we get there, let’s briefly discuss how common alternatives fail because they do not address all of the above aspects.

For example, a common approach is to say that we should replace this whole process with a social media site around publications. Let’s just call it SciNet for now (and I know that already exists, but it’s really hard to find something in that namespace which isn’t already taken). “Likes” or “Recommendations” would work as filtering, connections between users give people structure to navigate, or to form “Web of Trusts”, and so on.

This idea has some appeal, but it neglects the aspect of building a track record and giving an objective measure of scientific output, because you’ll have a hard time explaining to your funding agencies that that non-peer reviewed paper of yours is a solid piece of work because it got 1.5M “likes” on SciNet. I’m not saying that this probably cannot be solved, but you can’t just copy existing concepts, and you would also need to invest quite an amount of lobbying to convince the universities which hire professors and the funding agencies which pay for your research to accept these measures.

Other approaches focus mainly turn-around times and open access, proposing some central server which is a mixture of a preprint server and a perpetual archive. Such systems don’t really address the filtering aspect, and also don’t deal with the main problem of how to improve peer review.

Finally getting to the NoSQL part

So from a higher-level, we have a situation which is pretty common in engineering: We have a well-tested and established piece of “technology” for a complex problem. It’s been around for quite some time now, and it shows. Somehow, it hasn’t kept up with the acceleration of communication which the Internet brought about. We’ve seen how fast information can be exchanged, and we’d like to have that kind of quality for our professional scientific exchange as well.

Of course, there is still room for improvement. People could just work harder to write better reviews on time, action editors could press reviewers harder to give good reviews. Already communities have found ways around the long turn-around times by moving to conferences (like computer science) or preprint servers (like physics). Conferences are actually an interesting example, because they play a quite different role in computer science and mathematics. In CS, conferences have become as important as journals, which is problematic because the review process is quite different (as there is really no way for a revision). In mathematics, conferences are much more informal. Often you can apply with just an abstract. That way, conferences function mostly as a platform for exchange, and less as an outlet for publications.

But in order to change the problem, you either have to find a solution which is uniformly better than the current system on all the aspects I’ve talked about earlier, or you have to put an equal amount of work into marketing to convince people that some of the aspects are not important anymore.

All of this reminds me of the NoSQL movement. Classical relational database systems were the standard till a few years ago. Like peer review they address a very sophisticated set of requirements, and have been around for quite some time. However, it also became more and more apparent that they aren’t good for certain applications.

The main contribution of the NoSQL movement was to understand that some of the requirements could be weakened because they really weren’t that important for certain kinds of applications, and to see how that changed set of requirements could be used to produce systems which scale more easily.

What does this mean for the scientific publication system? I think to find an alternative process, we need to be fully aware of all the requirements the current system addresses, but we also need to question these requirements and be ready to fight hard to make people do the same. Because otherwise we’re stuck with finding a system which is better in all the aspects than the current system.

Note that this approach is also different from just focusing on one aspect and ignoring the rest as some of the approaches I’ve discussed above. It’s really something different to say “we’ve considered these requirements, but we think they aren’t important anymore” than “we just considered half of the problem for now.”

Rethinking why we have peer review

So the question is which parts of the problem can go? I think generally there is the tendency to believe that we don’t really need the publishers anymore. The Internet has made it very easy to publish something even in a permanent fashion, and most of the actual work has already been done by us anyway.

There is really no way around an efficient exchange of information and being able to find the information you look for. These are probably the core requirements.

Track records and objective measures of scientific output are of course important, but I think we might be able to find something new here eventually (and the current system also doesn’t really work well anyway. Daniel Lemire has a number of posts how papers as units of scientific work don’t make sense).

I think peer review is still very valuable, but its role probably needs to change. If we find more effective ways of filtering and measuring the impact, we no longer need peer review to be the first threshold to publication, and we no longer suffer from its errors or long turn-around times.

What can we do for now

So what can we do for now. Actually, I think you can do a lot. Don’t forget that we’re running this system ourselves. So whenever you are a reviewer, work hard to be an unbiased and fair reviewer. Never recommend to reject a paper just because you somehow missed the point and didn’t like the overall approach. NEVER reject a paper simply because it hasn’t compared itself against method X (there are thousands of methods out there), unless there is a very good reason to do so. NEVER reject a paper because you believe it is similar to method Y, unless you are very certain that they are very similar. In all the cases I got reviews like this, it never was true.

If you are an action editor or area chair, don’t accept bad reviews. If you organize a workshop, think about alternative ways to accept and review papers. Turn your blog into an informal journal, invite people to submit their work if they want to get the word out.

If you are in a position to discuss with decision makers in funding agencies, talk to them about alternative ways to measure scientific output. If you are in a committee to hire new faculty members, don’t just rely on impact factors to assess the scientific output of a member, but encourage the others to also look at the contributions of a candidate to the community besides peer reviewed publications.

And if you want to develop something new, always be aware of the full complexity of the problem, and be ready to explain why you neglect some of its aspects.

For further reading, Marcio von Muhlen has an interesting post called “We Need a Github of Science” which covers a lot of ground, and also tries to take into account the whole problem.

A last piece of advice: First get tenure or some other kind of permanent position, then work on improving the system. Always remember that others are publishing papers in the old system while you fantasize about a better world.

Short Review: Visualize This by Nathan Yau

I can’t really remember how I came across this book. I think it was recommended by Amazon. The price was ok (about 31€), at least for a Wiley book, so I just went ahead and bought it. You probably know Nathan Yau from his blog FlowingData where he frequently posts visualizations and interesting infographics.

Overall, the book is quite nice. It starts with some basic discussion about on visualizations in general, stressing the fact that visualizations are an excellent tool to tell the stories behind statistical data. It then goes through some tools, starting with Excel, and then covering tools like R, as well as JavaScript plotting libraries like protovis (now abandoned in favor of D3.js), and several other more specialized libraries, for example, for maps.

The remainder of the book goes through different kinds of visualizations in detail, from timeseries data, scatter plots, maps, etc. Each of the chapters focuses on one tool and show how to get the final plot in great detail.

What I found particularly interesting is that his workflow almost always includes refining the plot in Illustrator to make the graphic more appealing and to add explanations and further labels. This might be nice if you create static visualizations, but if you want to generate dynamic visualizations automatically from data, you’ll have to keep tweaking the original plot until it looks nice enough.

The book is probably too entry level if you already have some experience with data analysis or programming. It tries to require no prior knowledge in programming, although I wonder whether you can really learn how to use R from the examples alone. On the other hand, if you want to do some visualization with maps, for example, it’s nice to have almost complete examples in there.

I also particularly liked the introduction and the final chapter which give a lot of interesting insight on the business of creating visualizations.