MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

Cross-post from my tumblr

The NY Times has an article basically complaining that the big social network sites aren’t releasing their data and that they are hurting research.

Actually, I can understand the companies here. Releasing such data is a big privacy issue because it’s very hard to make sure your data is anonymized. Anyone still remembers why there wasn’t a second Netflix competition? They got sued after the first run and decided to cancel it because they couldn’t ensure to protect the users privacy.

For many of those companies, that big pile of data is basically all they have, so they won’t just give it away for free, be it for research purposes, or not.

Also, data always used to be pretty scarce in social network research. If you look at review articles on social networks like this one, you see that most of the research focused on a small number of data sets, for example, the karate school data set, the dolphin data set, or the monastery data set, all of which have been assembled by hand by some researchers. Ironically, the largest available data set so far is the Enron data set which has been released as part of the trial against the Enron bankruptcy.

So I think it’s wrong to expect companies like Twitter to happily release a substantial portion of their data for research purposes. On the other hand, I also think there is a very real problem of poorly validated research in that area. For example, Daniel Gayo-Avello has this very interesting review article on arXiv where he discusses that many papers on predicting elections are seriously flawed. Another example is the paper “Twitter mood predicts the stock market” by Johan Bollen et al. which is also seriously methodologically flawed.

Again I think is wrong to blame the lack of available data here. Of course it’s easier to validate research if you have the data to rerun the experiments and analyses, but I think (as I’ve said before) that we also need to resist the urge to jump the current big data and data science wave and get back to doing properly validate research in the first place.

Video for talk: 'TWIMPACT: Real-Time Twitter Analysis'

The video for my talk on real-time twitter analysis is online.

The talk was given at the Apache Hadoop Get Together in Berlin on April 18, 2012 in Berlin, Germany.

Introducing Data Science Seminars

Update June 13, 2012: Unfortunately, we didn’t get enough registrations for the June date, so we postponed the meeting. You can still register to indicate you’re interested. Once we have enough registrations, we’ll look for a date. Then you have the choice whether to confirm your registration or to unregister.

I’m very excited to announce our first Data Science Seminar. It will be a one-day seminar taking place on June 8th, 2012, in Berlin. The seminar is aimed at professionals who are dealing with data in their job and want to learn how to extract meaningful information for whatever business they are working on.

The approach we take is inspired by the way we have been teaching a practical course in machine learning for the last few years at TU Berlin. This is a hugely successful (well, at least based on the feedback we’re getting, sometimes even years later) one semester course where we discuss basic algorithms for all kinds of data analysis questions with a strong focus on practical experience.

Normally, if you go to some data analysis lecture or read a book like “The Elements of Statistical Learning”, “Pattern Recognition and Machine Learning”, or “Pattern Classification”, the approach is usually quite heavy on the mathematical foundations, often at the expense of intuition. The concepts and ideas behind the formuals are something you have to uncover for yourself, which requires quite some experience with that kind of thinking.

For the practical course, we took a different approach which emphasized concepts and practical experience with the algorithm at hand. This was done as follows: We first explained the problem the algorithm tries to solve and explained how the algorithm works in general terms. Then, we went through the algorithm in pseudo-code. Next, the students were asked to implement the algorithm and apply it to a number of given data sets, play with the data and the parameters of the data set to get some intuition on how the algorithm works. Implementing the algorithm and finding the bugs in the algorithm proved to be very informative as well. At the end of the course, each student had implemented half a dozen of algorithms and got a very good feeling for the strength of weaknesses of the algorithms and how the algorithms work.

The data science seminar will follow in this spirit and focus on concepts and ideas more than mathematical formulas. We will also touch upon topics like how to represent your data, or how to validate that the results produced in your analysis are actually reliable, which are often underrepresented in textbooks. We will also give you insight into our experience in working with real-world data sets and tell you about the “dos and dont’s” of data analysis.

You can register for the course on the course website. If you have more information or questions, write an email to contact@datascience-berlin.de. We also have a number of 10%-coupons which we’ll hand out to the first five people asking for them at the email address above, so be quick if you want that discout ;)