Wednesday, September 05, 2012

Levels of Abstractions in Big Data

File under: machine room

Recently we’ve spoken to a number of people to find out how our real-time stuff could be of use to them. Those were all very interesting conversions, but sometimes it happened that it became more about how to fit our approach into an existing infrastructure than discussing the merits of our approach for a given application.

I think it’s not uncommon if discussing software that the question of how to fit something on an existing framework is important, but for two reasons I think that’s even more the case with Big Data.

In the end, it’s a question about what kind of abstraction a piece of software provides. You might be familiar with the idea that the language used to describe something influences the way you think about a problem. That is certainly even more true in programming where the way a certain piece of software models some part of reality is not just theory, but puts hard constraints on what you can do. Some things are easier to express than others, some things might require workarounds, while other things might be downright impossible.

While this is certainly true of any type of software, I think the consequences are even more severe in the area of Big Data because abstractions there are often quite new and potentially imperfect and because there is considerably more technical lock-in into a given solution.

Many of the tools like Hadoop or NoSQL data bases are quite new and are still exploring concepts and ways to describe operations well. It’s not like the interface has been honed and polished for years to converge to a sweet spot. For example, secondary indices have been missing from Cassandra for quite some time. Likewise, whether features are added or not is more driven by whether it’s technically feasible than whether it’d make sense or not. But this often means that you are forced to model your problems in ways which might be inflexible and not suited to the problem at hand. (Of course, this is not special to Big Data. Implementing neural networks on a SQL database might feasible, but is probably also not the most practical way to do it.)

In particular for Big Data, exchanging one solution for another can be quite hard. You’ve already invested into your infrastructure, all your data is in there, your monitoring infrastructure is tailored to your compute cluster, and so on. And few have the resources to run two different clusters in parallel. Also, since the field is quite new and there is little standardization, switching between frameworks is impossible without rewriting core functionalities.

You might ask why you should care. After all, the technical properties of the software you are using are a reality you have to deal with. But I think being too focused on how your current solution views the world might be a problem when you try to explore other approaches to do interesting stuff with your data, and ways to scale your computations.

So what can we do about this? Not much, but we can try to keep in mind that there are different ways to think about algorithms and represent them. Some classical examples:

In the end, I think you should choose the abstraction which let’s you express the algorithm in the simplest way to think about it, instead of letting your existing installation tell you how to do it. A nice thing about software is that you can actually map between abstractions by writing a bit of interface code. It might not always be painless and certain operations might be more expensive than you’d expect them to be, but what you gain is flexibility in thinking about your algorithms.

Posted by Mikio L. Braun at 2012-09-05 21:31:00 +0200

blog comments powered by Disqus