The future of Big Data (according to Stratosphere/Flink)
The DIMA group at TU Berlin have a very interesting project which at first looks pretty similar to Apache Spark, the quite hyped Map-Reduce successor, but underneath the hood is has some interesting new ideas. Alexander Alexandrov recently gave an interesting overview talk at the TU Berlin, which I'll try to summarize in the following.
The project, which originally conceived as Stratosphere, but recently added as an incubator project to Apache and undergoing a renaming process towards Flink (German for nimble, swift, speedy), is an in-memory scalable data processing framework. Like Spark, the exposed API looks like a function collection API. That is, your data is essentially sequences and list of any kind of data, and you have methods like map, groupBy, or reduce to perform computations on these data sets which will then automatically be scaled out to the cluster.
The main difference to Spark is that Flink takes an approach known from databases to optimize queries in several ways. Flink calls this a declarative approach to data analysis, meaning that you don't have to write down in painstaking detail how the data is to be processed, but instead describe in a higher level fashion what you want to compute. It's a bit like when you're kid says "I'm thirsty", instead of asking you politely to hand over the juice.
This idea itself is anything but knew but is a cornerstone of relational database technology. There, an SQL query describes in a high-level fashion what you want to compute, and then the database itself decides how to exactly perform that query, using the availability of secondary indices, statistics about relationships, size of involved tables, and so on, to find a solution which executes in the shortest amount of time.
In databases, this optimization step is usually guided by estimating the cost of performing a certain sequence of operations. Then, one can optimize the query in a number of ways:
-
Operations can be changed using equivalent expressions or by reordering operations. For example, if there is some filtering or selection operation involved, one can try to perform this filtering early on.
-
Different alternatives exist to perform the same operation, depending on the size of the data, the existance of secondary indices and so on. Databases often use statistics to estimate the cost of something.
So far so good. Now the goal of the Flink project is to extend this approach also to scalable data processing. There, even more dimensions of optimization potential exist, in particular to try minimizing the amount of data shuffling. Sometimes, you can get faster execution times if you presort your data on your cluster, so that subsequent operations can be performed locally. However, even more time could be saved if the query can be formulated such that you don't need to move the data as well.
This means that when you write down a data processing pipeline in Flink, it will actually turn that into an intermediate representation and optimize the query further, using reorderings of operations, and selecting the appropriate algorithms to achieve the best performance.
In addition, Flink also provides higher level abstractions for iterations (unlike Spark so far). This makes sense, because a more expressive query language means you have more potential for global optimization.
Well, at least in principle. Already for databases, these optimization problems are very hard, and heuristics must be used to achieve good results. In a way, the challenge here is to have a powerful query language without having a fully general purpose programming language at which point one is dealing with very general questions of code optimization.
But there is another area where I think that Flink is quite interesting: Such query optimization systems are always based on a specific data modle. For databases, these are typically the kinds of collections of typed tuples everyone knows, but this is another area where you can in principle extend the system. One such project which is still in its infancy, unfortunately, is formulating a matrix library based on Flink which could then benefit from same query optimization techniques, for example, to reorder matrix operations to minimize computation time. Combined with the scalability of Flink, this promises some powerful potential for optimization such that the data scientist can focus on what to compute without spending too much time on how to do it.
Find out more on their project page: Stratosphere
Comments (3)
Mikio, Spark comes with Spark SQL, which internally has an optimizer (Catalyst) that produces logical and physical plans. This is the same approach most DBMS take. It can indeed take a high-level declarative query (specified in the Spark SQL DSL or in SQL), optimize it, and then execute it. So I believe the differences you point to don't seem to be obvious.
Hi Michael,
I was referring to the actions on the RDDs like map, grouping, joins, and so on. There, and correct me if I'm wrong, Spark is executing them in the way they are described, whereas Flink is already doing "query" optimization on that level.
I'm expecting a lot of convergence in this area. I don't think there is a technical reason why Spark couldn't add that level of optimization later on. I've talked quite a bit with the guys behind Stratosphere, and the idea to do that amount of optimization is at the very foundation of what they are trying to accomplish, so I think it makes a difference here.
Hope this clears it up a bit!
M
Spark does optimizations on maps (e.g. map-fusion). But more importantly, Spark SQL has a DSL that lets you do things like where(), select(), etc, and it creates SchemaRDDS, which then are optimized by catalyst (which has logical plans, as well as physical plans):