My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

Cross-post from my tumblr

The NY Times has an article basically complaining that the big social network sites aren’t releasing their data and that they are hurting research.

Actually, I can understand the companies here. Releasing such data is a big privacy issue because it’s very hard to make sure your data is anonymized. Anyone still remembers why there wasn’t a second Netflix competition? They got sued after the first run and decided to cancel it because they couldn’t ensure to protect the users privacy.

For many of those companies, that big pile of data is basically all they have, so they won’t just give it away for free, be it for research purposes, or not.

Also, data always used to be pretty scarce in social network research. If you look at review articles on social networks like this one, you see that most of the research focused on a small number of data sets, for example, the karate school data set, the dolphin data set, or the monastery data set, all of which have been assembled by hand by some researchers. Ironically, the largest available data set so far is the Enron data set which has been released as part of the trial against the Enron bankruptcy.

So I think it’s wrong to expect companies like Twitter to happily release a substantial portion of their data for research purposes. On the other hand, I also think there is a very real problem of poorly validated research in that area. For example, Daniel Gayo-Avello has this very interesting review article on arXiv where he discusses that many papers on predicting elections are seriously flawed. Another example is the paper “Twitter mood predicts the stock market” by Johan Bollen et al. which is also seriously methodologically flawed.

Again I think is wrong to blame the lack of available data here. Of course it’s easier to validate research if you have the data to rerun the experiments and analyses, but I think (as I’ve said before) that we also need to resist the urge to jump the current big data and data science wave and get back to doing properly validate research in the first place.

React to this post