First posted on July 30, 2016 on medium. This version contains minor corrections and a few links.
When I enrolled in Computer Science in 1995, Data Science didn’t exist yet, but a lot of the algorithms we are still using already did. And this is not just because of the return of the neural networks, but also because probably not that much has fundamentally changed since back then. At least it feels to me this way. Which is funny considering that starting this year or so AI seems to finally have gone mainstream.
1995 sounds like an awful long time ago, before we had cloud computing, smartphones, or chatbots. But as I have learned these past years, it only feels like a long time ago if you haven’t been there yourself. There is something about the continuation of the self which pastes everything together and although a lot has changed, the world didn’t feel fundamentally different than it does today.
Not even Computer Science was nowhere as mainstream as it was today, that came later, with the first dot com bubble around the year 2000. Some people even questioned my choice to study computer science at all, because apparently programming computers was supposed to become so easy no specialists are required anymore.
Actually, artificial intelligence was one of the main reasons for me to study computer science. The idea to use it as an constructive approach to understanding the human mind seemed intriguing to me. I went through the first two years of training, made sure I picked up enough math for whatever would lie ahead, and finally arrived in my first AI lectured held by Joachim Buhmann, back then professor at the University of Bonn (where Sebastian Thrun was just about to leave for the US).
I would have to look up where in his lecture cycle I joined but he had two lectures on computer vision, one on pattern recognition (mostly from the old editions of the Duda & Hart book), and one in information theory (following closely the book by Cover & Thomas). The material was interesting enough, but also somewhat disappointing. As I now know, people stopped working on symbolic AI and instead stuck to more statistical approaches to learning, where learning essentially was reduced to the problem of picking the right function based on a finite amount of observations.
The computer vision lecture was even less about learning and relied more on explicit physical modelling to derive the right estimators, for example, to reconstruct motion from a video. The approach back then was much more biologically and physically motivated than nowadays. Neural networks existed, but everybody was pretty clear that they were just “another kind of function approximators.”
Everyone with the exception of Rolf Eckmiller, another professor where I worked as a student. Eckmiller had built his whole lab around the premise that “neural computation” was somehow inherently better than “conventional computation”. This was back in the days when NIPS had full tracks devoted to studying the physiology and working mechanisms of neurons, and there were people who believed there is something fundamentally different happening in our brains, maybe on a quantum level, that gives rise to the human mind, and that this difference is a blocker for having truly intelligent machines.
While Eckmiller was really good at selling his vision, most of his staff was thankfully much more down to earth. Maybe it is a very German thing, but everybody was pretty matter of fact about what these computational models could or couldn’t do, and that has stuck with me throughout my studies.
I graduated in October 2000 with a pretty farfetched master thesis trying to make a connection between learning and hard optimization problems, then started on my PhD thesis and stuck around in this area of research till 2015.
While there had always been attempts to prove industry relevance, it was a pretty academic endeavor for a long while, and the community was pretty closed up. There were individual success stories, for example around handwritten character recognition, but many of the companies around machine learning failed. One of these companies I remember was called Biowulf Technologies and one NIPS they went around recruiting people with a video which promised it to be the next “mathtopia”. In essence, this was the story of DeepMind, recruiting a bunch of excellent researchers and then hoping it will take off.
The whole community also revolved around one fashion to the next. One odd thing about machine learning as a whole is that there exist only a handful of fundamentally different problems like classification, regression, clustering, and so on, but a whole zoo of approaches. It is not like in physics (I assume) or mathematics where some generally agreed upon unsolved hard problems exist whose solution would advance the state of the art. This means that progress is often done laterally, by replacing existing approaches with a new one, still solving the same problem in a different way. For example, first there were neural networks. Then support vector machines came, claiming to be better because the associated optimization problem is convex. Then there was boosting, random forests, and so on, till the return of neural networks. I remember that Chinese Restaurant Processes were “hot” for two years, no idea what their significance is now.
Then there came Big Data and Data Science. Being still in academia at the time, it always felt to me as if this was definitely coming from the outside, possibly from companies like Google who had to actually deal with enormous amounts of data. Large scale learning always existed, for example for genomic data in bioinformatics, but one usually tried to solve problems by finding more efficient algorithms and approximations, not by parallelizing brute force.
Companies like Google finally proved that you can do something with massive amounts of data, and that finally changed the mainstream perception. Technologies like Hadoop and NoSQL also seemed very cool, skillfully marketing themselves as approaches so new, they wouldn’t suffer from the technological limitations of existing systems.
But where did this leave the machine learning researchers? My impression always was that they were happy that they finally got some recognition, but they were also not happy about the way this happened. To understand this, one has to be aware that most ML researchers aren’t computer scientists or very good or interested in coding. Many come from physics, mathematics or other sciences, where their rigorous mathematical training was an excellent fit for the algorithm and modeling heavy approach central to machine learning.
Hadoop on the other hand was extremely technical. Written in Java, a language perceived as being excessively enterprise-y at the time, it felt awkward and clunky compared to the fluency and interactiveness of first Matlab and then Python. Even those who did code usually did so in C++, and to them Java felt slow and heavy, especially for numerical calculations and simulations.
Still, there was no way around it, so they rebranded everything they did as Big Data, or began to stress, that Big Data only provides the infrastructure for large scale computations, but you need someone who “knows what he is doing” to make sense of the data.
Which is probably also not entirely wrong. In a way, I think this divide is still there. Python is definitely one if the languages of choice for doing data analysis, and technologies like Spark try to tap into that by providing Python bindings, whether it makes sense from a performance point of view or not.
Even before DeepDream, neural networks began making their return. Some people like Yann LeCun have always stuck to this approach, but maybe ten years ago, there where a few works which showed how to use layerwise pretraining and other tricks to train “deep” networks, that is larger networks than one previously thought possible.
The thing is, in order to train neural networks, you evaluate it on your training examples and then adjust all of the weights to make the error a bit smaller. If one writes the gradient across all weights down, it naturally occurs that one starts in the last layer and then propagate the error back. Somehow, the understanding was that the information about the error got smaller and smaller from layer to layer and that made it hard to train networks with many layers.
I’m not sure that is still true, as far as I know, many people are just using backprop nowadays. What has definitely changed is the amount of available data, as well as the availability of tools and raw computing power.
So first there were a few papers sparking the interest in neural networks, then people started using them again, and successively achieved excellent results for a number of application areas. First in computer vision, then also for speech processing, and so on.
I think the appeal here definitely is that you can have one approach for all. Why the hassle of understanding all those different approaches, which come from so many different backgrounds, when you can understand just one method and you are good to go. Also, neural networks have a nice modular structure, you can pick and put together different kinds of layers and architectures to adapt them to all kinds of problems.
Then Google published that ingenious deep dream paper where they let a learned network generate some data, and we humans with our immediate readiness to read structure and attribute intelligence picked up quickly on this.
I personally think they were surprised by how viral this went, but then decided the time is finally right to go all in on AI. So now Google is an “AI first” company and AI is gonna save the world, yes.
Many academics I have talked to are unhappy about the dominance of deep learning right now, because it is an approach which works well, maybe even too well, but doesn’t bring us much closer to really understand how the human mind works.
I also think the fundamental problem remains unsolved. How do we understand the world? How do we create new concepts? Deep learning stays an imitation on a behavioral level and while that may be enough for some, it isn’t for me.
Also, I think it is dangerous to attribute too much intelligence to these systems. In raw numbers, they might work well enough, but when they fail they do so in ways that clearly show they operate in an entirely different fashion.
While Google translate lets you skim the content of website in a foreign language, it is still abundantly clear that the system has no idea what it is doing.
Sometimes I feel like nobody cares, also because nobody gets hurt, right? But maybe it is still my German cultural background that would rather prefer we see things as they are, and take it from there.
So this was definitely a long radio silence! Since I blogged last time, a lot has happened.
I’ve quit my PostDoc job and joined Zalando, a big (the biggest?) European fashion retailer (revenue in 2015: about three billion Euros). I’m a “delivery lead”, kind of a technical lead role for two teams. One is running the recommendation service for all of Zalando, the other team is creating a new search backend service. It definitely is a management position, so I don’t code much (except for sometimes), but I find leadership extremely interesting, and challenging, and Zalando with its agile culture seems like the perfect place to be for me right now.
One of my last projects at the university was a lecture series on Scalable Machine Learning. I had originally planned to spend about the same amount of time on Big Data technology as on the theoretical underpinnings, but after interventions from other professors who were concerned about people earning “double credits” it became the most mathematical piece of teaching I ever did. This was quite an experience. I had planned to prepare a few weeks worth in advance, but in the end, I was often spending all of Tuesday and Wednesday to churn out those 40-50 slides per week for the lecture on Thursday.
I attended StrataHadoop in London last year, where Ben Lorica talked me into doing a video on Scalable Machine Learning for O’Reilly (finally, also being allowed to talk about the technical side). We recorded the video in Amsterdam during OSCON in October in a single day (also, quite an experience). It is geared at people who already have a good understanding of Data Science and ML, but are not yet familiar with large scale learning, or Big Data technology, and is intended as a starting point into technologies like Spark.
So my interests have shifted a bit, away from pure machine learning towards topics like facilitating collaboration between data scientists and engineers. Yesterday I gave a talk titled Hardcore Data Science in Practice to present my current state of insights into the matter, and I put my slides on the Internet. Zalando has heavily invested in data scientists and it’s very interesting to see how that works out day to day. A friend of mine has remarked that I’m the only person he knows who uses the term “Data Scientist” unironically. ;)
I also greatly enjoy working with developers on a daily basis. After all, I’m a computer scientist by training, but having worked in machine learning for so long where people come from many backgrounds like physics, mathematics, and so on, I almost forgot about this.
One blog post that just didn’t happened was a lengthy treatement of my reasons for leaving academia. Personally it makes total sense to me now, and I think there is a lot left to improve in academia, but I don’t see the point in talking about it right now.
There is obviously a lot of things happening right now in the hypespace. Internet of Things, chat bots, and suddenly AI is back and there are a few upcoming things I just have to say about that, too ;)
So thanks for sticking around, hopefully the next post will not take another year to write.
In case you haven’t heard yet, Data Science is all the craze. Courses, posts, and schools are springing up everywhere. However, every time I take a look at one of those offerings, I see that a lot of emphasis is put on specific learning algorithms. Of course, understanding how logistic regression or deep learning works is cool, but once you start working with data, you find out that there are other things equally important, or maybe even more.
I can’t really blame these courses. I’ve done years of teaching machine learning at universities, and these lectures always focus very much on specific algorithms. You learn everything about support vector machines, Gaussian mixture models, k-Means clustering, and so on, but only when you work on your master thesis do you learn how to properly work with data.
So what does properly mean anyway? Don’t the ends justify the means? Isn’t everything ok as long as I get good predictive performance? That is certainly true, but the key is to make sure that you actually get good performance on future data. As I’ve written elsewhere, it’s just too simple to fool yourself into believing your method works when all you are looking at are results on training data.
So here are my three main insights you won’t easily find in books.
The main goal in data analysis/machine learning/data science (or however you want to call is), is to build a system which will perform well on future data. The distinction between supervised (like classification) and unsupervised learning (like clustering) makes it hard to talk about what this means in general, but in any case you will usually have some data set collected on which you build and design your method. But eventually you want to apply the method to future data, and you want to be sure that the method works well and produces the same kind of results you have seen on your original data set.
A mistake often done by beginners is to just look at the performance on the available data and then assume that it will work just as well on future data. Unfortunately that is seldom the case. Let’s just talk about supervised learning for now, where the task is to predict some outputs based on your inputs, for example, classify emails into spam and non-spam.
If you only consider the training data, then it’s very easy for a machine to return perfect predictions just by memorizing everything (unless the data is contradictory). Actually, this isn’t that uncommon even for humans. Remember when you were memorizing words in a foreign language and you had to made sure that you were testing the words out of order, because otherwise your brain would just memorize the words based on their order?
Machines with their massive capacity for storing and retrieving large amounts of data can do the same thing easily. This leads to overfitting, and lack of generalization.
So the proper way to evaluate is to simulate the effect that you have future data by splitting the data, training on one part and then predicting on the other part. Usually, the training part is larger, and this procedure is also iterated several times in order to get a few numbers to see how stable the method is. The resulting procedure is called cross-validation.
Still, a lot can go wrong, especially when the data is non-stationary, that is, the underlying distribution of the data is changing over time. Which often happens when you are looking at data measured in the real world. Sales figures will look quite different in January than in June.
Or there is a lot of correlation between the data points, meaning that if you know one data point you already know a lot about another data point. For example, if you take stock prices, they usually don’t jump around a lot from one day to the other, so that doing the training/test split randomly by day leads to training and test data sets which are highly correlated.
Whenever that happens, you will get performance numbers which are overly optimistic, and your method will not work well on true future data. In the worst case, you’ve finally convinced people to try out your method in the wild, and then it stops working, so learning how to properly evaluate is key!
Learning about a new method is exciting and all, but the truth is that most complex method essentially perform the same, and that the real difference is made by the way in which raw data is turned into features used in learning.
Modern learning methods are pretty powerful, easily dealing with tens of thousand of features and hundreds of thousand of data points, but the truth is that in the end, these methods are pretty dumb. Especially methods that learn a linear model (like logistic regression, or linear support vector machines) are essentially as dumb as your calculator.
They are really good at identifying the informative features given enough data, but if the information isn’t in there, or not representable by a linear combination of input features, there is little they can do. The are also not able to do this kind of data reduction themselves by having “insights” about the data.
Put differently, you can massively reduce the amount of data you need by finding the right features. Hypothetically speaking, if you reduced all the features to the function you want to predict, there is nothing left to learn, right? That is how powerful feature extraction is!
This means two things: First of all, you should make sure that you master one of those nearly equivalent methods, but then you can stick with them. So you don’t really need logistic regression and linear SVMs, you can just pick one. This involves also understanding which methods are nearly the same, where the key point lies in the underlying model. So deep learning is something different, but linear models are mostly the same in terms of expressive power. Still, training time, sparsity of the solution, etc. may differ, but you will get the same predictive performance in most cases.
Second of all, you should learn all about feature engineering. Unfortunately, this is more of an art, and almost not covered in any of the textbooks because there is so little theory to it. Normalization will go a long way. Sometimes, features need to be taken the logarithm of. Whenever you can eliminate some degree of freedom, that is, get rid of one way in which the data can change which is irrelevant to the prediction task, you have significantly lowered the amount of data you need to train well.
Sometimes it is very easy to spot these kinds of transformations. For example, if you are doing handwritten character recognition, it is pretty clear that colors don’t matter as long as you have a background and a foreground.
I know that textbooks often sell methods as being so powerful that you can just throw data against them and they will do the rest. Which is maybe also true from a theoretical viewpoint and an infinite source of data. But in reality, data and our time is finite, so finding informative features is absolutely essential.
Now this is something you don’t want to say too loudly in the age of Big Data, but most data sets will perfectly fit into your main memory. And your methods will probably also not take too long to run on the data. But you will spend a lot of time extracting features from the raw data and running cross-validation to compare different feature extraction pipelines and parameters for your learning method.
The problem is all in the combinatorial explosion. Let’s say you have just two parameters, and it takes about a minute to train your model and get a performance estimate on the hold out data set (properly evaluated as explained above). If you have five candidate values for each of the parameters, and you perform 5-fold cross-validation (splitting the data set into five parts and running the test five times, using a different part for testing in each iteration), this means that you will already do 125 runs to find out which method works well, and instead of one minute you wait about two hours.
The good message here is that this is easily parallelizable, because the different runs are entirely independent of one another. The same holds for feature extraction where you usually apply the same operation (parsing, extraction, conversion, etc.) to each data set independently, leading to something which is called “embarrasingly parallel” (yes, that’s a technical term).
The bad message here is mostly for the Big Data guys, because all of this means that there is seldom the need for scalable implementations of complex methods, but already running the same undistributed algorithm on data in memory in parallel would be very helpful in most cases.
Of course, there exist applications like learning global models from terabytes of log data for ad optimization, or recommendation for million of users, but bread-and-butter use cases are often of the type described here.
Finally, having lots of data by itself does not mean that you really need all the data, either. The questions is much more about the complexity of the underlying learning problem. If the problem can be solved by a simple model, you don’t need that much data to infer the parameters of your model. In that case, taking a random subset of the data might already help a lot. And as I said above, sometimes, the right feature representation can also help tremendously in bringing down the number of data points needed.
In summary, knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.
I’m contemplating of putting together an ebook with articles like this one and some hands on stuff to get you started with data science. If you want to show your support, you can sign up here to get notified when the book is published.