Marginally Interesting: Analyzing Social Media Data

Analyzing social media has become quite popular. People have been predicting box office openings based on Twitter chatter, studied information diffusion patterns, information flows between classes of users, how real-world events like earthquakes are reflected in Twitter.

This is all pretty exciting and interesting, but there are also a few things where there is still room for improvement.

There is very little stuff on real-time analysis. Many papers boast with the hundreds of millions of tweets (and the access to Twitter’s firehose necessary to get that amount of data) which have formed the basis for the paper. However, many papers later introduce some more or less arbitrary ways of truncating the data, for example by taking a number of “most active users”. This is both true for Jure Leskovec’s paper as well as the Yahoo research’s paper.

However, I think that getting to real-time is extremely important, because you cannot just wait for days or longer to get your analysis. By that time, more data will have been streaming in, and when are you going to analyze that data?

Another problem with many of the analyses is that they focus on the positive cases only. Meaning that they develop some method to detect bursts or trends and then use some famous real-world example (like Japan winning the women’s soccer championship) to show that the method is triggered by the data. However, few publications go so far as to validate their method on negative examples as well, showing that the method not only detect trends well, but also does so robustly with few false positives.

A classical example is the highly cited 2003 paper by Jon Kleinberg “Bursty and Hierarchical Structure from Streams” which explains how to detect areas of higher than usual activity, for example, from email streams. But then, the paper shows how the detected structure coincides with real deadlines for two examples without discussing negative examples in depth.

Many methods also seem to believe that an analysis which is based on hundreds of millions of data points is automatically true in general. While this is certainly true for simple statistics which you can estimate well, there are other methods which can overfit. And for those, as many other disciplines like bioinformatics have had to learn the hard way, as you get more data, the probability that you find some evidence for your hypothesis increases drastically.

To get reliable results, you need to follow the same rules as when validating the performance of a machine learning algorithm: Test on data which is disjoint from training data. If your method detects trends, check it on data which you believe has no structure. If you aggregate topics, check it on days when nothing special was happening. If you analyze the structure of the data, check on an independent sample (ideally from a period of time which is a bit removed from the original sample).

That way you might have less data available, but your results will improve a lot in terms of reliability.

Posted by Mikio L. Braun at 2011-11-01 22:20:00 +0000

One does not simply scale into real-time

Scala discussion heating up?