MARGINALLY INTERESTING


MACHINE LEARNING, COMPUTER SCIENCE, JAZZ, AND ALL THAT

Data Analysis: The Hard Parts

I don’t know whether this word exists, but mainstreamification is what’s happening to data analysis right now. Projects like Pandas or scikit-learn are open source, free, and allow anyone with some Python skills do lift some serious data analysis. Projects like MLbase or Apache Mahout work to make data analysis scalable such that you can tackle those terabytes of old log data right away. Events like the Urban Data Hack, which just took place in London, show how easy it has become to do some pretty impressive stuff with data.

The general message is: Data analysis has become super easy. But has it? I think people want it to be, because they have understood what data analysis can do for them, but there is a real shortage in people who are good at it. So the usual technological solution is to write tools which empower more people do it. And for many problems, I agree that this is how it works. You don’t need to know TCP/IP to fetch some data from the Internet because there are libraries for that, right?

For a number of reasons, I don’t think that you can “toolify” data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I’d say it takes a lot of experience to be done properly and you need to know what you’re doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

And I don’t write this because I don’t like the projects which exists, but because I think it is important to understand that you can’t just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.

So my top four reasons are:

  1. data analysis is so easy to get wrong
  2. it’s too easy to lie to yourself about it working
  3. it’s very hard to tell whether it could work if it doesn’t
  4. there is no free lunch

Let’s take these one at a time.

Data Analysis is so easy to get wrong

If you use a library to go fetch some data from the Internet, it will give you all kinds of error message when you do something wrong. It will tell you if the host doesn’t exist or if you called the methods in the wrong order. The same is not true for most data analysis methods, because these are numerical algorithms which will produce some output even if the input data doesn’t make sense.

In a sense, Garbage In Garbage Out is even more true for data analysis. And there are so many ways to get this wrong, like discarding important information in a preprocessing step, or accidentally working on the wrong variables. The algorithms don’t care, they’ll give you a result anyway.

The main problem here is that you’ll probably not even notice it apart from the fact that the performance of your algorithms isn’t what you expect them to be. In particular when you work with many input features, there is really no way to look at the data. You are basically just working with large tables.

This is not just hypothetical, I have experienced many situations where exactly that happened, where people were accidentally permutating all their data because they messed up reading the data from the files, or did some other non-obvious preprocessing which destroyed all the information in the data.

So you always need to be aware of what you are doing and have to mentally trace the steps to have a well-informed expectation about what you are doing. It’s debugging without an error message, often you just have a gut feeling that something is quite wrong.

Sometimes it’s not even that the performance is bad, but also because it’s just very good. Let’s come to that next.

It’s too easy to lie to yourself about it working

The goal in data analysis is always good performance on future, unseen data. This is quite a challenge. Usually you start working from collected data, which you hope is representative of the future data. But it is so easy to fool yourself into thinking it works.

The most important rule is that only test performance on data which you haven’t used in any way for training is reliable for future performance. However, this rule can be violated in many, sometimes subtle ways.

The classical novice mistake is to just take the whole data set, train an SVM or some other algorithm and look at the performance you get on the data set you used for training. Obviously, it will be quite good. In fact, you can achieve perfect predictions when you just output the values you got for training (ok, if they are unambiguous) without any real learning taking place at all.

But even if you split your data right, people often make the mistake of using information from the test data in the preprocessing (for example for centering, or building dictionaries, etc.). So you do the actual training only on the training data, but through the preprocessing information has silently crept into the test data as well, giving results which are much better than what you can realistically expect on real data.

Finally, even if you do proper testing and evaluation of your method, your estimates of future performance will become optimistic as you try out many different approaches because you implicitly optimize for the test set as well. This is called multiple testing and something one has to be aware of, too.

One can be trained to do all this properly, but if you are under the pressure to produce results, you have to resist the temptation to just run with the first thing which gives good numbers. And it helps if you’ve gone that route once and failed miserably.

And even if you did evaluate according to all secrets of the trade, the question is still whether the data you worked on was really representative of future data.

It’s very hard to tell whether it could work if it doesn’t

A different problem is that it is fundamentally difficult to know whether you can get better if your current approach doesn’t work well. The first thing you try will most likely not work, as will probably the next thing, and then you need someone with experience to tell you whether there is a chance or not.

There is really no way to automatically tell whether a certain approach works or not. The algorithms just extract whatever information fits their model and the representation of the data, but there are many, many ways to do this differently, and that’s when you need a human expert.

Over time you develop a feeling for whether a certain piece of information is contained in the data or not, and ways to make that information more prominent through some form of preprocessing.

Tools only provide you with possibilities, but you need to know how to use them.

There is no free lunch

Now you might think “but can’t we build all that into the tools?” Self-healing tools which tell you when you make mistakes and automatically find the right preprocessing? I’m not saying that it’s impossible, but these are problems which are still hard and unsolved in research.

Also, there is no universally optimaly learning algorithm as shown by the No Free Lunch Theorem: There is no algorithm which is better than all the rest for all kinds of data.

No way around learning data analysis skills

So in essence, there is no way around properly learning data analysis skills. Just like you wouldn’t just give a blowtorch to anyone, you need proper training so that you know what you’re doing and produce robust and reliable results which deliver in the real-world. Unfortunately, this training is hard, as it requires familiarity with at least linear algebra and concepts of statistics and probability theory, stuff which classical coders are not that well trained in.

Still, it’s pretty awesome to have those tools around, back when I started with my Ph.D. everyone had his own private stack of code. Which is ok if you’re in research and need to implement methods from scratch anyway. So we’re definitely better off nowadays.

What is going on with DeepMind and Google?

As you might have heard, Google has acquired DeepMind, a London based artificial intelligence startup for an undisclosed sum, although rumor has it that the sum was somewhere close to $500M. Now that is a lot of money for a company which hasn’t released a product or service yet and has practically been in stealth mode since its beginning. I had heard about them before, but only when someone asked me whether I knew them and what they were up to, because they seemed to have a lot of money.

So what is going on? Is this the next bubble, the AI bubble? One cannot deny that there is a lot of interest in certain kinds of learning algorithms, in particular deep learning algorithms. I don’t want to argue whether this is AI or not, but such algorithms have proven to work well when dealing with data like images or sound, and “understanding” this kind of data to get better search and discovery is quite important to companies like Google or Facebook.

Companies have already invested quite heavily in that area (although not in this price range). In March 2013, Google bought DNNresearch, a company where neural network veteran Geoff Hinton (who co-invented the backprop training algorithm, among other things) was involved. Hinton later joined Google part time. In December 2013, Facebook announced at the annual NIPS conference that Yann LeCun, another neural networks veteran, joins Facebook to head their research center. Amazon has set up a new research lab headed by Ralf Herbrich, who formerly worked at Microsoft Research as well as Facebook, with offices in Seattle, Berlin, and Bangalore, attracting senior machine learning people as well.

Others on Twitter were also asking themselves what is going on and over the past day and a half we were putting together some pieces of the puzzle.

First of all, DeepMind really has a very strong talent pool. I haven’t checked all, but many are senior researchers with an excellent standing in the machine learning community. Probably not for being super-applied, but nonetheless very bright people. Shane Legg, one of the co-founders, has worked in an area of ML which employs certain complexity-based measures which allows one to construct “universal” learning machines with nice theoretical properties, but very little practical impact. In fact, most of these underlying measures aren’t even computable. (Yes, that’s right, you can prove that you cannot write a program which computes these)

Other who are known to work for DeepMind are Alex Graves, who also worked with Geoff Hinton and has worked on recurrent neural networks, which are well suited to deal with time-series, in particular with audio data. Apparently, he has the state of the art on the TIMIT corpus for speech recognition. Then there is Koray Kavukcuoglu who also worked on deep learning, in particular for vision. He is also a co-author of the Torch7 machine learning library, and in the discussion we asked half seriously whether Google wanted to make sure that Facebook didn’t get the whole Torch team, as the other two authors of Torch have close ties with Yann LeCun and might therefore prefer to go to Facebook if they decided to leave their current position in academia.

Apparently, DeepMind had a pretty impressive demo at the deep learning workship at last year’s NIPS (yes, the one which Mark Zuckerberg attended) where they trained a computer to play pong using Reinforcement Learning, which a learning setting where there is very little and indirect feedback for chosen actions. For input, the raw pixels on the screen were used so that the learning algorithm indeed had to make quite a lot of inference to somehow learn concepts such as what the ball is and what the bat is and what the rules of the game are.

Martin Riedmiller, who also appears in the author list with a deepmind.com email address, is a well-known professor from Freiburg, Germany, who has already some experience in applying reinforcement learning to real-world problems. In 2010, he gave a talk at Dagstuhl about controlling slot cars only using a video feed of the track.

So from all this, I think that DeepMind managed to attract a significant number of top-notch researchers from the field of machine learning. However, this demo, while technically impressive, can still be considered to be close to the published state of the art IMHO. So saying that DeepMind got somehow closer to “solving AI” than the rest of the community seems like a long shot. (I should probably add that I don’t think the current state of the art in machine learning is anywhere close to real AI, but that is another post worth of thoughts.)

People attending the deep learning workshop also reported that there was quite a bit of interest from both Facebook and Google towards DeepMind, but the talks with Facebook apparently lead nowhere, maybe because Facebook was more interesting in hiring a few of the people, but not buying the whole company.

Which still leads to the question whether $500M was justified or not. According to recode.net, the company had raised $50M, so that posed some lower limits from the investor side. I admit that initially my thoughts were “What The F…”, but now I think it’s probably justified if you consider that people who master the technology on this level are very rare, probably in the low hundreds world wide, and if an acquisition of DeepMind secures you 50 of them, then this can be quite important strategically.

So what is going to happen? According to recode, DeepMind will join Google in the “Search” division, which already contains such well-reputed people like Samy Bengio, who joined Google a while ago and was principally responsible for improving their image search. So at least DeepMind won’t die a horrible death of never meeting with Google’s infrastructure, as there already people who understand very well how to turn academia-level research to something that works.

On the other hand, in the 15 years or so I’ve followed the machine learning community, there is a recurring event of companies hiring many bright people, usually under the premise that you can just work on interesting stuff. The first such company was WhizBang! labs in the late 90s, then came BIOwulf in the early 00s. One of these years, I heard stories about a promotional video they showed to potential recruitees at NIPS, promising to build the next “mathtopia” inside BIOwulf.

In a way, that is the dream of every researcher, just do interesting and cool stuff unencumbered by the tradition and the bureaucracies of academia. So far, this often led to companies which closed down a few years later because people were taking the promise a bit to seriously and did just that: work on interesting and cool stuff, neglecting the business side of it.

DeepMind probably would have met the same fate. Or not. Having raised such an enormous amount of money and closing this deal suggests that they are quite good at selling their idea and their company. But the current interest and arms race into deep learning technology certainly made it easier for them to have this fabulous exit.

Of course, now they have to deal with the bureaucracy and politics inside Google. I hope they will succeed. Because worse than the occasional mind-boggling expensive acquisition would be the relevation that it’s not really worth it.

Thanks to beaucronin, johnmyleswhite, ogrisel, syhw, and quesada for their contributions and the interesting conversation!

Apache Spark: The Next Big Data Thing?

Apache Spark is generating quite some buzz right now. Databricks, the company founded to support Spark raised $14M from Andreessen Horowitz, Cloudera has decided to fully support Spark, and others chime in that it’s the next big thing. So I thought it’s high time I took a look to get an understanding what the whole buzz is around.

I played around with the Scala API (Spark is written in Scala), and to be honest, at first I was pretty underwhelmed, because Spark looked, well, so small. The basic abstraction are Resilient Distributed Datasets (RDDs), basically distributed immutable collections, which can be defined based on local files or files stored in on Hadoop via HDFS, and which provide the usual Scala-style collection operations like map, foreach and so on.

My first reaction was “wait, is this basically distributed collections?” Hadoop in comparison seemed to be so much more, a distributed filesystem, obviously map reduce, with support for all kinds of data formats, data sources, unit testing, clustering variants, and so on and so on.

Others quickly pointed out that there’s more to it, in fact, Spark also provides more complex operations like joins, group-by, or reduce-by operations so that you can model quite complex data flows (without iterations, though).

Over time it dawned on me that the perceived simplicity of Spark actually said a lot more about the Java API of Hadoop than Spark. Even simple examples in Hadoop usually come with a lot of boilerplate code. But conceptually speaking, Hadoop is quite simple as it only provides two basic operations, a parallel map, and a reduce operation. If expressed in the same way on something resembling distributed collections, one would in fact have an even smaller interface (some projects like Scalding actually build such things and the code looks pretty similar to that of Spark).

So after convincing me that Spark actually provides a non-trivial set of operations (really hard to tell just from the ubiqitous word count example), I digged deeper and read this paper which describes the general architecture. RDDs are the basic building block of Spark and are actually really something like distributed immutable collections. These define operations like map or foreach which are easily parallelized, but also join operations which take two RDDs and collects entries based on a common key, as well as reduce-by operations which aggregates entries using a user specified function based on a given key. In the word count example, you’d map a text to all the words with a count of one, and then reduce them by key using the word and summing up the counts to get the word counts. RDDs can be read from disk but are then held in memory for improved speed where they can also be cached so you don’t have to reread them every time. That alone adds a lot of speed compared to Hadoop which is mostly disk based.

Now what’s interesting is Spark’s approach to fault tolerance. Instead of persisting or checkpointing intermediate results, Spark remembers the sequence of operations which led to a certain data set. So when a node fails, Spark reconstructs the data set based on the stored information. They argue that this is actually not that bad because the other nodes will help in the reconstruction.

So in essence, compared to bare Hadoop, Spark has a smaller interface (which might still become similarly bloated in the future), but there are many projects on top of Hadoop (like Twitter’s Scalding, for example), which achieve a similar level of expressiveness. The other main difference is that Spark is in-memory by default, which naturally leads to a large improvement in performance, and even allows to run iterative algorithms. Spark has no built- in support for iterations, though, it’s just that they claim it’s so fast that you can run iterations if you want to.

Spark Streaming - return of the micro-batch

Spark also comes with a streaming data processing model, which got me quite interested, of course. There is again a paper which summarizes the design quite nicely. Spark follows an interesting and different approach compared to frameworks like Twitter’s Storm. Storm is basically like a pipeline where you push individual events in which then get processed in a distributed fashion. Instead, Spark follows a model where events are collected and then processed at short time intervals (let’s say every 5 seconds) in a batch manner. The collected data become an RDD of their own which is then processed using the usual set of Spark applications.

The authors claim that this mode is more robust against slow nodes and failures, and also that the 5 second interval are usually fast enough for most applications. I’m not so sure about this, as distributed computing is always pretty complex and I don’t think you can easily say that something’s are generally better than others. This approach also nicely unifies the streaming with the non- streaming parts, which is certainly true.

Final thoughts

What I saw looked pretty promising, and given the support and attention Spark receives, I’m pretty sure it will mature and become a strong player in the field. It’s not well-suited for everything. As the authors themselves admit, it’s not really well suited to operations which require to change only a few entries the data set at the time due to the immutable nature of the RDDs. In principle, you have to make a copy of the whole data set even if you just want to change one entry. This can be nicely paralellized, but is of course costly. More efficient implementations based on copy-on-write schemes might also work here, but are not implement yet if I’m not mistaken.

Stratosphere is research project at the TU Berlin which has similar goals, but takes the approach even further by including more complex operations like iterations and not only storing the sequence of operations for fault tolerance, but to use them for global optimization of the scheduling and paralellization.

Immutability is pretty on vogue here as it’s easier to reason about, but I’d like to point you to this excellent article by Baron Schwartz on how you’ll always end up with mixed strategies (mutable and immutable data) to make it work in the real-world.