Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast.
The terms “scalable” and “large scale” have been used in machine learning circles long before there was Big Data. There had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. So finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question.
Interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the Big Data kind of sense. Part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today.
Instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently.
To illustrate this, let’s search for NIPS papers (the annual Advances in Neural Information Processing Systems, short NIPS, conference is one of the big ML community meetings) for papers which have the term “scalable” in the title.
Here are some examples:
Scalable Inference for Logistic-Normal Topic Models
… This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation …
Partially collapsed Gibbs sampling is a kind of estimation algorithm for certain graphical models.
A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
… With […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours …
Stochastic variational inference algorithm is both an approximation and an estimation algorithm.
Scalable kernels for graphs with continuous attributes
… In this paper, we present a class of path kernels with computational complexity $O(n^2(m + \delta^2 ))$ …
And this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could.
Usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. Still, the actual “scalability” comes from the algorithmic side.
So how do scalable ML algorithms look like? A typical example are the stochastic gradient descent (SGD) class of algorithms. These algorithms can be used, for example, to train classifiers like linear SVMs or logistic regression. One data point is considered at each iteration. The prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller.
Vowpal Wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning:
There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation.
So “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. For SGD type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. The main problem to speed this kind of computation up is how to stream the data by fast enough.
To put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily.
I know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of Spark, but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. While this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set.
If you want to know more, this large scale learning challenge Sören Sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets.
Of course, there are things which can be easily scaled well using Hadoop or Spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. Another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. Again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine.
I’ve just scratched the surface of this, but I hope you got the idea that scalability can mean quite different things. In Big Data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. In machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. Often, this allows you to speed up your computations significantly by decoupling computations. Parallelization is important, too, but alone it won’t get you very far.
Luckily, there are projects like Spark and Stratosphere/Flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.
The DIMA group at TU Berlin have a very interesting project which at first looks pretty similar to Apache Spark, the quite hyped Map-Reduce successor, but underneath the hood is has some interesting new ideas. Alexander Alexandrov recently gave an interesting overview talk at the TU Berlin, which I’ll try to summarize in the following.
The project, which originally conceived as Stratosphere, but recently added as an incubator project to Apache and undergoing a renaming process towards Flink (German for nimble, swift, speedy), is an in-memory scalable data processing framework. Like Spark, the exposed API looks like a function collection API. That is, your data is essentially sequences and list of any kind of data, and you have methods like map, groupBy, or reduce to perform computations on these data sets which will then automatically be scaled out to the cluster.
The main difference to Spark is that Flink takes an approach known from databases to optimize queries in several ways. Flink calls this a declarative approach to data analysis, meaning that you don’t have to write down in painstaking detail how the data is to be processed, but instead describe in a higher level fashion what you want to compute. It’s a bit like when you’re kid says “I’m thirsty”, instead of asking you politely to hand over the juice.
This idea itself is anything but knew but is a cornerstone of relational database technology. There, an SQL query describes in a high-level fashion what you want to compute, and then the database itself decides how to exactly perform that query, using the availability of secondary indices, statistics about relationships, size of involved tables, and so on, to find a solution which executes in the shortest amount of time.
In databases, this optimization step is usually guided by estimating the cost of performing a certain sequence of operations. Then, one can optimize the query in a number of ways:
Operations can be changed using equivalent expressions or by reordering operations. For example, if there is some filtering or selection operation involved, one can try to perform this filtering early on.
Different alternatives exist to perform the same operation, depending on the size of the data, the existance of secondary indices and so on. Databases often use statistics to estimate the cost of something.
So far so good. Now the goal of the Flink project is to extend this approach also to scalable data processing. There, even more dimensions of optimization potential exist, in particular to try minimizing the amount of data shuffling. Sometimes, you can get faster execution times if you presort your data on your cluster, so that subsequent operations can be performed locally. However, even more time could be saved if the query can be formulated such that you don’t need to move the data as well.
This means that when you write down a data processing pipeline in Flink, it will actually turn that into an intermediate representation and optimize the query further, using reorderings of operations, and selecting the appropriate algorithms to achieve the best performance.
In addition, Flink also provides higher level abstractions for iterations (unlike Spark so far). This makes sense, because a more expressive query language means you have more potential for global optimization.
Well, at least in principle. Already for databases, these optimization problems are very hard, and heuristics must be used to achieve good results. In a way, the challenge here is to have a powerful query language without having a fully general purpose programming language at which point one is dealing with very general questions of code optimization.
But there is another area where I think that Flink is quite interesting: Such query optimization systems are always based on a specific data modle. For databases, these are typically the kinds of collections of typed tuples everyone knows, but this is another area where you can in principle extend the system. One such project which is still in its infancy, unfortunately, is formulating a matrix library based on Flink which could then benefit from same query optimization techniques, for example, to reorder matrix operations to minimize computation time. Combined with the scalability of Flink, this promises some powerful potential for optimization such that the data scientist can focus on what to compute without spending too much time on how to do it.
Find out more on their project page: Stratosphere
Last Tuesday, I gave a talk at this year’s Berlin Buzzword conference on using stream mining algorithms to efficiently store information extracted from user behavior to perform personalization and recommendation effectively already using a single computer, which is of course key behind streamdrill.
If you’ve been following my talks, you’ll probably recognize a lot of stuff I’ve talked about before, but what is new in this talk is that I tried to take the next step from simply talking about Heavy Hitters and Count- Min Sketches to using these data structures as an approximate storage for all kinds of analytics related data like counts, profiles, or even sparse matrices, as they occur recommendations algorithms.
I think reformulating our approach as basically an efficient approximate data structure also helped to steer the discussion away from comparing streamdrill to other big data frameworks (“Can’t you just do that in Storm?” — “define ‘just’”). As I said in the talk, the question is not whether you can do it in Big Data Framework X, because you probably could. I have started look at it from the other direction: we did not use any Big Data framework and were still able to achieve some serious performance numbers.
In any case, below are the slides and the video from the talk.