Attending AWSSummit 2014 in Berlin

And thoughs on infrastructure vs data science

Tuesday, May 20, 2014

Last week I attended the AWS Summit in Berlin. I honestly hadn’t heard about these venues before, but a friend was going and asked me whether I would join, so I said “yes”, in particular since it was free.

I was pretty surprised be the size of the whole event, probably more than a thousand people were listening to Werner Vogels keynote and four tracks of talks on all aspects of Amazon’s web services. The location itself (the Postbahnhof in the eastern part of Berlin) was actually pretty bad. Seating capacity was insufficient, people barely fit the keynote and later on people often had to be turned down because rooms were filled to capacity. Initially, they were also still checking all the badges with a very low throughput handheld QR code reader, but later people were still often stuck in the narrow corridors of the building. So, ironically, a lot of bandwidth problems, and little of the elastic scaling AWS is priding themself on in the real world.

The event hit of with a nice keynot by Werner Vogels, CTO of Amazon. What I found interesting, though, was that they were still trying very hard to sell the benefits of moving to the cloud. By now I think that it’s pretty clear to everyone what the advantages are, like being able to scale resources up and down quickly, or not having to worry about buying, hosting, and mainting physical servers. Other issues like privacy were stressed as well (and very obviously to address concerns about the NSA or other people spying into cloud infrastructure). Then again, I think in reality issues are not as clear cut and there sometimes are good reasons why you don’t want to move all your stuff into the cloud, so one has to make a balanced assessment.

There were also egregious claims like AWS being a key factor in lowering failure of software projects. I don’t think buying too many servers or too few is really the single reason for failure, what about misspecification, miscommunication, and underestimated complexity? At another point, Vogels explained how scale effects allowed Amazon to lower the prices continually (you lower prices, you get more customers, you expand your hardware, you get economics of scale, you can lower prices, etc.), whereas I think that advances in hardware efficiency also play a key role here.

I was particularly interested in learning about Apache Kinesis. Based on the documentation (“real-time this, real-time that”) I was under the impression that it was a Storm like stream processing system, but then I learned that it was mostly infrastructure for ingesting huge amounts of event data in a scalable fashion in a buffer which holds data for later analysis. So it’s really more a scalable, persistent, robust transport layer than anything else. You can have multiple workers consuming the kinesis stream, for example, by hooking it up to a Storm topology, but at the basis, it’s only about transport. The unit of scale is a shard, where a shard will be able to handle 1000 transactions per second and 1MB/s ingoing and 2MB/s outgoing data, which I thought wasn’t really so much.

Just to put this into perspective: for one of our projects with streamdrill (you know this’d be coming, sorry about that, but it’s really something where I can talk from my own practical experience), we’re easily consuming up to 10k events per second, with events being up to about 1kB, on a single machine, giving roughly a ten-fold speedup and throughput versus the clustered solution. You can very clearly see the cost of scaling out. First you have to accept a performance hit which comes from the whole network communication and coordination overhead.

What AWS and many other guys are doing, is that they are constructing building blocks for infrastructure. Then you can put Kinesis, Storm, and S3 together to get a scalable analysis system.

On the other hand, an integrated solution can often be much faster as in our case with streamdrill which combines data management, analysis, and storage backend (in-memory). Somehow, if you use existing service you may end up in a situation where you lost the opportunity to do important optimizations across modules.

In a way, modularization is the standard game in programming, you try to isolate services or routines you need often, building abstractions in order to decouple parts of your program. If done right, you have something with high reuse value. I think all the standard computer science algorithms and data structures fall into this category. Cloud computing, on the other hand, is a pretty new topic, and people are basically making up abstractions and services as they go along and you don’t always end up with a set of services which will lead to maximal performance. In a way, these services give you a toolbox, but if all you have are pipes, there things you cannot build if you need other building blocks, too, like filters.

Interestingly, when it comes to data analysis, I think that there are other problems with this approach. As I’ve discussed elsewhere, we’re not yet at the point where you can just pick a data science algorithm and use it without knowing what you do. Machine learning and data science is not yet just about building infrastructure and abstractions but also still about finding out how to properly solve the problems there are.

Posted by Mikio L. Braun at 2014-05-20 17:45:00 +0000

blog comments powered by Disqus