MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Big Data and Market Research

About a month ago I was invited to speak at a meeting of the Bundesverband der Markt- und Sozialvorscher (German association of market and social researchers) on Big Data and Social Media. It was a one-day meeting held in Munich with the aim to given an overview of the Big Data phenomeon for market research people and discuss the potential.

Social-media wise, there were quite few prolific German Twitterers at the meeting, starting with Stephan Noller, the CEO of nugg.ad, Jörg Blumtritt of MediaCom, or Benedikt Köhler. Actually, there was such a buzz that by midday, the meeting’s hashtag was trending on German Twitter. This was also one of those meetings where my phone was alight with notifications after my talk of people life-tweeting pictures from my talk.

For me as a data science guy, it was quite interesting to get some insights into what big data and social media means for marketing people. For example, while it is pretty clear that there’s a plethora of data on Facebook and Twitter, market research people don’t quite know how to make best use of the data because there is such a strong unknown bias in the sampling.

Normally market research people use all kinds of techniques to make sure that their sample is representative. This allows them to make quite accurate predictions for the whole population based on a relatively small set of data. On the other hand, data is abundant on social networks, but it is completely unclear how these numbers relate to the whole population.

Also, and this is probably a German phenomenon, people are still somewhat sceptic and consider social media a waste of time at best. If people don’t ask me directly why the heck I’m on Twitter, they almost always admit that they wouldn’t know what to do there. (Luckily, years of talking to people who admit that they’ve always been bad at math have prepared me for these awkward moments. In a way. At least I know the feeling).

Anyway, the meeting closed with an interesting panel discussion where one of the conclusions were that big data will be a big topic, but probably not in connection with social media. Instead, they predicted that market research will move back into the companies, where there had been a trend to outsource in recent years. There, market research can be integrated much better with the whole company to drive decisions based on data in a closely-knit fashion. Just as companies like Amazon are already doing, big data will play in important role in optimizing businesses.

News from the Buzzwordosphere: Fast Data and In-Memory Analytics

Ok, I get that Big Data is a pretty big hype right now and I also get that humans like to give names to phenomena to give them a handle to think about it. However, the amount of buzzword bingo around the whole Big Data sphere is really just staggering.

ZDNet has a nicely interlinked blog post by Tony Baer on “Fast Data”, apparently the new tag for real-time, low latency analytics. I recently attended the BerlinBuzzwords conference and can confirm that real-time analytics is in fact a pretty hot topic right now. Over and over again people have admitted that they really had no current analytics on key performance indicators of their infrastructure, be it the number of games installed or other kinds of metrics, and they discussed different ways to run batch-processing systems like Hadoop in tiny iterations to get closer to real-time.

Of course, real-time data mining is nothing new, and as I’ve already discussed elsewhere there exists a whole field of research called stream mining to deal with these topics. However, it looks like the industry is just beginning to adopt these techniques.

Another insight I’ve also discussed is that disks are often too slow for real-time. Unless all of your requests can be served from the cache in memory, an access to disk is quite slow and you cannot get beyond a few hundred requests per second. Now from a machine learning point of view, having all your data in memory and running analysis methods on it also isn’t something special. After all, all the usual frameworks like R, matlab or scipy work that way: Read the data, clean the data, run an analysis, write a report. I’d say ML (and also data science or computational statistics for that matter) is so memory-centric that most of my colleagues view a database as just another storage format for data exchange.

However, the idea of using your memory for something else than disk caches seems to be so mindboggling new to database guys that they invented a completely new term for it: “in-memory analytics.” Apparently in an effort to jump the Big Data Buzzword Bandwagon, companies like SAP, Oracle, or SAS have started to offer “in-memory analytics” products and solutions which are basically just the way you normally process data, at least in my world ;)

I think we still need a few more buzzwords, so here are a few more suggestions:

  • Big Data Science Neither Big Data or Data Science is sufficient, we need Big Data Science!

  • Real-Time Big Data IMHO, sounds better than Fast Data and also has Big Data in it, a clear win.

  • Small Data That way, we can bring all existing algorithms which don’t quite scale back into the Big Data world. The main selling point here is that these methods are often exact and not just approximations, leading to much more accurate results!

I’m only half-joking here. A friend of mine who works at TeraData has told me that classical data base vendors have started to interpret NoSQL as “Not Only SQL”.

Talk: Scalability Challenges in Big Data Science

Scalability Challenges in Big Data Science

Yesterday I gave a talk on scalability and machine learning at the BerlinBuzzword conference. I give an overview of different ways to scale data analysis and machine learning methods. I cover MapReduce (of course), large scale training of SVMs via stochastic gradient descent, but also stream mining, and real-time (as you know, “you don’t just scale into real-time”).

The conference continues today, follow the conference on Twitter on the #bbuzz hashtag.

Update: On scribd, the hyperlinks are somehow lost, so here is the list:

Scalable Databases

Multithreadding and Messaging Frameworks

MapReduce

Large Scale Classifier Training

Other frameworks

Stream processing

TWIMPACT: