The future of Big Data (according to Stratosphere/Flink)

The DIMA group at TU Berlin have a very interesting project which at first looks pretty similar to Apache Spark, the quite hyped Map-Reduce successor, but underneath the hood is has some interesting new ideas. Alexander Alexandrov recently gave an interesting overview talk at the TU Berlin, which I’ll try to summarize in the following.

The project, which originally conceived as Stratosphere, but recently added as an incubator project to Apache and undergoing a renaming process towards Flink (German for nimble, swift, speedy), is an in-memory scalable data processing framework. Like Spark, the exposed API looks like a function collection API. That is, your data is essentially sequences and list of any kind of data, and you have methods like map, groupBy, or reduce to perform computations on these data sets which will then automatically be scaled out to the cluster.

The main difference to Spark is that Flink takes an approach known from databases to optimize queries in several ways. Flink calls this a declarative approach to data analysis, meaning that you don’t have to write down in painstaking detail how the data is to be processed, but instead describe in a higher level fashion what you want to compute. It’s a bit like when you’re kid says “I’m thirsty”, instead of asking you politely to hand over the juice.

This idea itself is anything but knew but is a cornerstone of relational database technology. There, an SQL query describes in a high-level fashion what you want to compute, and then the database itself decides how to exactly perform that query, using the availability of secondary indices, statistics about relationships, size of involved tables, and so on, to find a solution which executes in the shortest amount of time.

In databases, this optimization step is usually guided by estimating the cost of performing a certain sequence of operations. Then, one can optimize the query in a number of ways:

  • Operations can be changed using equivalent expressions or by reordering operations. For example, if there is some filtering or selection operation involved, one can try to perform this filtering early on.

  • Different alternatives exist to perform the same operation, depending on the size of the data, the existance of secondary indices and so on. Databases often use statistics to estimate the cost of something.

So far so good. Now the goal of the Flink project is to extend this approach also to scalable data processing. There, even more dimensions of optimization potential exist, in particular to try minimizing the amount of data shuffling. Sometimes, you can get faster execution times if you presort your data on your cluster, so that subsequent operations can be performed locally. However, even more time could be saved if the query can be formulated such that you don’t need to move the data as well.

This means that when you write down a data processing pipeline in Flink, it will actually turn that into an intermediate representation and optimize the query further, using reorderings of operations, and selecting the appropriate algorithms to achieve the best performance.

In addition, Flink also provides higher level abstractions for iterations (unlike Spark so far). This makes sense, because a more expressive query language means you have more potential for global optimization.

Well, at least in principle. Already for databases, these optimization problems are very hard, and heuristics must be used to achieve good results. In a way, the challenge here is to have a powerful query language without having a fully general purpose programming language at which point one is dealing with very general questions of code optimization.

But there is another area where I think that Flink is quite interesting: Such query optimization systems are always based on a specific data modle. For databases, these are typically the kinds of collections of typed tuples everyone knows, but this is another area where you can in principle extend the system. One such project which is still in its infancy, unfortunately, is formulating a matrix library based on Flink which could then benefit from same query optimization techniques, for example, to reorder matrix operations to minimize computation time. Combined with the scalability of Flink, this promises some powerful potential for optimization such that the data scientist can focus on what to compute without spending too much time on how to do it.

Find out more on their project page: Stratosphere

Realtime personalization and recommendation with stream mining

My talk at Berlin Buzzwords

Last Tuesday, I gave a talk at this year’s Berlin Buzzword conference on using stream mining algorithms to efficiently store information extracted from user behavior to perform personalization and recommendation effectively already using a single computer, which is of course key behind streamdrill.

If you’ve been following my talks, you’ll probably recognize a lot of stuff I’ve talked about before, but what is new in this talk is that I tried to take the next step from simply talking about Heavy Hitters and Count- Min Sketches to using these data structures as an approximate storage for all kinds of analytics related data like counts, profiles, or even sparse matrices, as they occur recommendations algorithms.

I think reformulating our approach as basically an efficient approximate data structure also helped to steer the discussion away from comparing streamdrill to other big data frameworks (“Can’t you just do that in Storm?” — “define ‘just’”). As I said in the talk, the question is not whether you can do it in Big Data Framework X, because you probably could. I have started look at it from the other direction: we did not use any Big Data framework and were still able to achieve some serious performance numbers.

In any case, below are the slides and the video from the talk.

Attending AWSSummit 2014 in Berlin

And thoughs on infrastructure vs data science

Last week I attended the AWS Summit in Berlin. I honestly hadn’t heard about these venues before, but a friend was going and asked me whether I would join, so I said “yes”, in particular since it was free.

I was pretty surprised be the size of the whole event, probably more than a thousand people were listening to Werner Vogels keynote and four tracks of talks on all aspects of Amazon’s web services. The location itself (the Postbahnhof in the eastern part of Berlin) was actually pretty bad. Seating capacity was insufficient, people barely fit the keynote and later on people often had to be turned down because rooms were filled to capacity. Initially, they were also still checking all the badges with a very low throughput handheld QR code reader, but later people were still often stuck in the narrow corridors of the building. So, ironically, a lot of bandwidth problems, and little of the elastic scaling AWS is priding themself on in the real world.

The event hit of with a nice keynot by Werner Vogels, CTO of Amazon. What I found interesting, though, was that they were still trying very hard to sell the benefits of moving to the cloud. By now I think that it’s pretty clear to everyone what the advantages are, like being able to scale resources up and down quickly, or not having to worry about buying, hosting, and mainting physical servers. Other issues like privacy were stressed as well (and very obviously to address concerns about the NSA or other people spying into cloud infrastructure). Then again, I think in reality issues are not as clear cut and there sometimes are good reasons why you don’t want to move all your stuff into the cloud, so one has to make a balanced assessment.

There were also egregious claims like AWS being a key factor in lowering failure of software projects. I don’t think buying too many servers or too few is really the single reason for failure, what about misspecification, miscommunication, and underestimated complexity? At another point, Vogels explained how scale effects allowed Amazon to lower the prices continually (you lower prices, you get more customers, you expand your hardware, you get economics of scale, you can lower prices, etc.), whereas I think that advances in hardware efficiency also play a key role here.

I was particularly interested in learning about Apache Kinesis. Based on the documentation (“real-time this, real-time that”) I was under the impression that it was a Storm like stream processing system, but then I learned that it was mostly infrastructure for ingesting huge amounts of event data in a scalable fashion in a buffer which holds data for later analysis. So it’s really more a scalable, persistent, robust transport layer than anything else. You can have multiple workers consuming the kinesis stream, for example, by hooking it up to a Storm topology, but at the basis, it’s only about transport. The unit of scale is a shard, where a shard will be able to handle 1000 transactions per second and 1MB/s ingoing and 2MB/s outgoing data, which I thought wasn’t really so much.

Just to put this into perspective: for one of our projects with streamdrill (you know this’d be coming, sorry about that, but it’s really something where I can talk from my own practical experience), we’re easily consuming up to 10k events per second, with events being up to about 1kB, on a single machine, giving roughly a ten-fold speedup and throughput versus the clustered solution. You can very clearly see the cost of scaling out. First you have to accept a performance hit which comes from the whole network communication and coordination overhead.

What AWS and many other guys are doing, is that they are constructing building blocks for infrastructure. Then you can put Kinesis, Storm, and S3 together to get a scalable analysis system.

On the other hand, an integrated solution can often be much faster as in our case with streamdrill which combines data management, analysis, and storage backend (in-memory). Somehow, if you use existing service you may end up in a situation where you lost the opportunity to do important optimizations across modules.

In a way, modularization is the standard game in programming, you try to isolate services or routines you need often, building abstractions in order to decouple parts of your program. If done right, you have something with high reuse value. I think all the standard computer science algorithms and data structures fall into this category. Cloud computing, on the other hand, is a pretty new topic, and people are basically making up abstractions and services as they go along and you don’t always end up with a set of services which will lead to maximal performance. In a way, these services give you a toolbox, but if all you have are pipes, there things you cannot build if you need other building blocks, too, like filters.

Interestingly, when it comes to data analysis, I think that there are other problems with this approach. As I’ve discussed elsewhere, we’re not yet at the point where you can just pick a data science algorithm and use it without knowing what you do. Machine learning and data science is not yet just about building infrastructure and abstractions but also still about finding out how to properly solve the problems there are.