A year ago, I wrote a post on the real-time big data landscape, identifying different approaches to deal with real-time big data. As I saw it back then, there was sort of an evolution from database based approaches (put all your data in, run queries), up to stream processing (one event at a time), and finally algorithmic approaches relying on stream mining algorithsm, together with all kinds of performance “hacks” like parallelization, or using memory instead of disks.
In principle, this picture is still adequate in terms of the underlying mode of data processing, that is, where you store your data, whether you process it as it comes in or in a more batch oriented fashion later on, and so on, but there is always the question how to build systems around these approaches. And given the amount of money which is currently infused into the whole Big Data company landscape, quite a lot is happening in that area.
Currently, there is a lot of convergence happening. One such example is the lambda architecture, which combines batch-oriented processing with stream processing to get both low-latency results (potentially inaccurate and incomplete) and results on the full data sets. Instead of scaling batch processing to a point where the latency is small enough, a parallel stream processing layer processes events as they come along, with both routes piping results into a shared database to provide the results for visualization or other kinds of presentation.
Some point out that one problem with this approach is that you potentially need to have all your analytics logic in two distinct, and conceptually quite different systems. But there are systems like Apache Spark, which can run the same code in a batch fashion or near-streaming in micro-batches, or Twitter’s Scalding, which can take the same code to run on Hadoop or Storm.
Others, like Linkedin’s Jay Kreps, ask why you can’t use stream processing also to recompute stuff in batch. Such systems can be implemented by combining a stream processing system with a system like Apache Kafka which is a distributed publish/subscribe event transport layer which doubles as a database for log data by retaining data for a predefined amount of time.
These kinds of approaches make you wonder just how interchangable streaming and map-reduce style processing really is, whether it allows you to do the same set of operations. If you think about it, map-reduce is already very stream oriented. In classical Hadoop, both the data input and output to the map and reduce stage is presented via iterators and output pipes, so that you could in principle also stream by the data. In fact, Scalding seems to be taking advantage of exactly that.
Generally, this “functional collection” style APIs seem to become quite popular, as Spark and also systems like Apache Flink use that kind of approach. If you haven’t seen this before, the syntax is very close to the set of operations you have in functional languages like Scala. The basic data type is a collection of objects and you formulate your computations in terms of operations like map, filter, groupby, reduce, but also joins.
This raises the question what exactly streaming analytics is. For some, streaming is any kind of approach which allows you to process data in one go, without the need to go back, and also with more or less bounded resource requirements. Interestingly, this seems to naturally lead to functional collection style APIs, like illustrated in the toolz Python library, although one issue for me here is always that the functional collection style APIs imply that the computation ends at some point, when in reality, it does not.
The other family of APIs uses a more actor-based approach. Stream processing systems like Apache Storm, Apache Samza, or even akka use that kind of approach where you are basically defining worker nodes which take in a stream of data and output another one, and you construct systems by explicitly sending messages asynchronously around between those nodes. In this setting, the on-line nature of the computation is much more explicit.
I personally find actor based approaches always a bit hard to work with mentally, because you have to slice up operations into different actors just to parallelize when conceptually it’s just one step. The functional collection style approach works much better here, however, you then have to rely on the underlying system being able to parallelize your computations well. Systems like Flink take ideas from query optimizations in databases here to attack this problem which I think is a very promising approach.
In general, what I personally would like to see is even more convergence between the functional collection and actor based approaches. I haven’t found too much on that but, to me, that seems like something which is bound to happen.
Concerning data input and output, I find it interesting that all of these approaches don’t deal with the question of how to get at the results of your analysis. One of the key features of real-time is that you need to get results as the data comes in, so results have to be continuously updated. This is IMHO also not modelled well in the functional collection style APIs, which imply that the function call returns once the result is computed. Which is never when you process data in an online fashion.
The answer to that solution seems to be to use your highly parallelized, low-latency computation to deal with all the data, but then periodically write out results to some fast, distributed storage layer like a redis database and use that to query the results. It’s generally not possible to access a running stream processing system “from the side” to get at the state which is somewhere distributed in this system. While this approach is possible, it seems to me that it requires you to set up yet another distributed system just to store results.
Concerning data input, there’s of course the usual coverage of all possible kinds of input, from REST, UDP packages, messaging frameworks, log files, and so on. I currently find Kafka quite interesting, because it seems like a good abstraction of combination of a bunch of log files and a log database. You get a distributed set of log data together with the ability to go back in time and replay data. In a way, this is exactly what we had been doing with TWIMPACT when analyzing Twitter data.
Which brings me back to streamdrill (you knew, this was coming, right?), less because I need to tell you just how great it is, but because it sort of defines where I stand in this landscape myself.
So far, we’ve mainly focussed on the core processing engine. The question of getting the data out has been answered quite differently from the other approaches, as you can directly access the results of your computation by querying the internal state via a REST interface. For getting historical data, you still need to push the data to a storage backend, though. Directly exposing the internal state of the computation is such a big detour from other approaches that I don’t see how you could easily retrofit streamdrill on top of Spark or akka, even though it would be great to get scaling capabilities that way.
I think the most potential for improvement with streamdrill is the part where you encode the actual computation. So far, streamdrill is written and deployed as a more or less classical Jersey webapp, which means that everything is very event-driven. We’re trying to separate functional modules from the REST endpoint code, but still it would need a fair understanding of Java webapps to write anything yourself (and I honestly don’t see data scientists doing that). Here, a more high-level, “functional collection”-style approach would definitely be better.
Posted by Mikio L. Braun at 2014-08-11 13:51:00 +0200blog comments powered by Disqus