Marginally Interesting: The future of Big Data (according to Stratosphere/Flink)

The DIMA group at TU Berlin have a very interesting project which at first looks pretty similar to Apache Spark, the quite hyped Map-Reduce successor, but underneath the hood is has some interesting new ideas. Alexander Alexandrov recently gave an interesting overview talk at the TU Berlin, which I’ll try to summarize in the following.

The project, which originally conceived as Stratosphere, but recently added as an incubator project to Apache and undergoing a renaming process towards Flink (German for nimble, swift, speedy), is an in-memory scalable data processing framework. Like Spark, the exposed API looks like a function collection API. That is, your data is essentially sequences and list of any kind of data, and you have methods like map, groupBy, or reduce to perform computations on these data sets which will then automatically be scaled out to the cluster.

The main difference to Spark is that Flink takes an approach known from databases to optimize queries in several ways. Flink calls this a declarative approach to data analysis, meaning that you don’t have to write down in painstaking detail how the data is to be processed, but instead describe in a higher level fashion what you want to compute. It’s a bit like when you’re kid says “I’m thirsty”, instead of asking you politely to hand over the juice.

This idea itself is anything but knew but is a cornerstone of relational database technology. There, an SQL query describes in a high-level fashion what you want to compute, and then the database itself decides how to exactly perform that query, using the availability of secondary indices, statistics about relationships, size of involved tables, and so on, to find a solution which executes in the shortest amount of time.

In databases, this optimization step is usually guided by estimating the cost of performing a certain sequence of operations. Then, one can optimize the query in a number of ways:

Operations can be changed using equivalent expressions or by reordering operations. For example, if there is some filtering or selection operation involved, one can try to perform this filtering early on.
Different alternatives exist to perform the same operation, depending on the size of the data, the existance of secondary indices and so on. Databases often use statistics to estimate the cost of something.

So far so good. Now the goal of the Flink project is to extend this approach also to scalable data processing. There, even more dimensions of optimization potential exist, in particular to try minimizing the amount of data shuffling. Sometimes, you can get faster execution times if you presort your data on your cluster, so that subsequent operations can be performed locally. However, even more time could be saved if the query can be formulated such that you don’t need to move the data as well.

This means that when you write down a data processing pipeline in Flink, it will actually turn that into an intermediate representation and optimize the query further, using reorderings of operations, and selecting the appropriate algorithms to achieve the best performance.

In addition, Flink also provides higher level abstractions for iterations (unlike Spark so far). This makes sense, because a more expressive query language means you have more potential for global optimization.

Well, at least in principle. Already for databases, these optimization problems are very hard, and heuristics must be used to achieve good results. In a way, the challenge here is to have a powerful query language without having a fully general purpose programming language at which point one is dealing with very general questions of code optimization.

But there is another area where I think that Flink is quite interesting: Such query optimization systems are always based on a specific data modle. For databases, these are typically the kinds of collections of typed tuples everyone knows, but this is another area where you can in principle extend the system. One such project which is still in its infancy, unfortunately, is formulating a matrix library based on Flink which could then benefit from same query optimization techniques, for example, to reorder matrix operations to minimize computation time. Combined with the scalability of Flink, this promises some powerful potential for optimization such that the data scientist can focus on what to compute without spending too much time on how to do it.

Find out more on their project page: Stratosphere

Posted by Mikio L. Braun at 2014-06-10 17:05:00 +0000

Realtime personalization and recommendation with stream mining

What is Scalable Machine Learning?