Marginally Interesting by Mikio L. Braun

Misconceptions about the CAP Theorem

If you’ve ever listened to a NoSQL talk, you’ve probably come across the CAP theorem. The argument usually goes like this:

Traditional databases guarantee consistency.
The CAP theorem tells you that you cannot have consistency, availability, and fault-tolerance at the same time.
But we want to build scalable databases, so we forget about consistency.
Oh and by the way, who needs consistency anyway?

To be honest, to me this always looked like some poor excuse to not really discuss the design decisions of some NoSQL database. It’s probably just me, but I much prefer at least an attempt at an unbiased analysis of the pros and cons so that I can make an informed decision whether it fits my needs or not. But pulling this theorem out of the hat is like saying “we don’t even need to discuss this, because this theorem says impossible, ok!”

While searching for discussions of the CAP theorem, I found this excellent (but lengthy) article by Eric Brewer, one of the original authors of the CAP theorem: CAP Twelve Years Later: How the “Rules” Have Changed.

Here is my summary:

First of all, the interpretation that the CAP theorem says “you can only have 2 out of 3” is misleading. It’s not like the original proof discussed all possible choices and showed that you can have only 2 out of 3.

Instead, the original proof discusses the following situation: Say you have a distributed system which is in a consistent state (whatever that means), and now there is a __P__artition of the system, either an actual network failure, or some other way in which machines cannot talk to one another anymore.

Now consider what options you have when there is write request. You could wait for the partition to end in order to make sure that your system stays in a consistent state (thereby sacrificing __A__vailability), or you could do the update partially (thereby sacrificing __C__onsistency). So you can have only C or A in case of P but not both.

Note that there is really no way in which you could “choose P”, it was always about how to handle partitions (which are often not really partitions, but timeouts), and that includes how to detect partitions, how to behave when you are in a “partition state”, and how to bring the system back to a consistent state after a partition.

The article stresses that these are no binary decisions, but that there is rather a whole spectrum of possibly actions and strategies to choose from. It’s not about saying “I can’t have consistency and availability, so I’ll just forget about consistency”, it’s about saying “in case of a failure, availability is more important to me, therefore I will accept temporary inconsistencies, and implement strategies to clean up afterwards”.

When you look at it that way, you get a much clearer picture of how a database like Cassandra fits into this, and how their read repair, hinted handoff features work to regain consistency, although in a very lax (and eventual) way.

But it also becomes clear that it’s just not true that you cannot have distributed databases which are highly available and come with consistency guarantees at all. The article goes on to discuss recent research results which try to achieve exactly that, strategies to minimize the impact of a partition on availability and consistency, how to re-establish consistency after a partition (also in the broader database sense of having consistent cross-references between tables and satisfying other invariants)

So the next time someone tells you he doesn’t care about consistency because of the CAP theorem, ask him how he chooses P, and how he deals with the detection, handling, and cleanup of partitions.

Posted by Mikio L. Braun on 2013-03-20 21:35:00 +0000

Lunch talk: Google Reader

As you’ve probably heard by now, Google is shutting down Google Reader on July 1, 2013. Reactions are mixed, ranging from people who say that RSS is dead anyway, or that they get all their info from other services like prismatic anyway.

RSS was dead years ago #getoverit
— Nick Halstead (@nik) March 14, 2013

Google Reader retiring is of no impact 2 me. I moved to @prismatic quite some time back. It's time for @prismatic 2 pull up 2 the next level
— Debasish Ghosh (@debasishg) March 14, 2013

Meanwhile web sites are posting lists of Google Reader alternatives like feedly, flipboard, or newsblur are compiled.

So obviously, thinkberg, who is a big fan of Reader, and I had a lot to discuss over lunch. So what is our take on this?

RSS is dead?

I agree that RSS is probably much too technical for the average user, but what people seem to forget about RSS is that it’s half of the open decentralized social network framework guys like App.net or diaspora are trying to build. You can host your own content, you can subscribe to content no matter where it’s hosted and assemble your own timeline. Also, RSS’s use is much more widespread than you’d probably think. All news sites, and blogs support RSS feeds. The obvious exception are the big social network sites like Google+, and Twitter (although you can export real-time searches as RSS feeds, which is pretty nice), because they want you to spend time on their site while they show you some ads.

It’s not perfect, of course, there are no comments or discussions (but tumblr is also much less interactive than facebook, for example), it’s pull, not push, but the openness and extensibility (in the sense that you could hook up your coffee-machine if you want to) is undisputable.

I think Google shutting down the Reader is mostly another example how Google is relentlessly redesigning their services to meet some grand plan without caring much about whether the user will be happy with the end result or not. It’s almost as if they are taking an engieering approach and have a hard time taking into account anything which is not a technical service requirement.

I guess it’s one thing to optimize Google’s infrastructure internally, where your dependencies are relatively few and clearly defined and you have other Google people at the other end of the line who know that these changes are necessary. Contrast this with a service used by millions of people, most of whom are don’t really want anything to change.

For example, I think people didn’t really care whether Picasa got integrated with Google+, or whether Google Docs became part of Google Drive. So it might have made sense for Google’s long-term goal to have a single social platform which incorporates all these services somehow, but most people are lost somewhere along the line.

It’s not as if Google isn’t trying to get social right, however. Brian Shih, who once was product manager for Google Reader, says on Quora:

Ironically, I think the reason Google always wanted to pull the Reader team off to build these other social products was that the Reader team actually understood social (and tried a lot of experiments over the years that informed the larger social features at the company). Reader’s social features also evolved very organically in response to users, instead of being designed top-down like some of Google’s other efforts.

Google+ to the rescue?

So what about the alternatives? What about Google+?

As thinkberg nicely pointed out, the difference is that Google Reader is (was) a productivity tool, while Google+ is a social network site. Google Reader is optimized to help you sift through many news items quickly, while Google+ wants you to linger (possible to show you as many ads as possible). In the compressed layout, Reader uses exactly one line per item, showing you dozens on a single screen. In constrast, in Google+, you seldom get more than three items per screen.

Of course, you can share items on Google+, but the old Reader was excellent in that it presented shared items also in the same compact view and by users, shared items are simply lost in the your timeline stream.

Finally, and I’m still amazed that this hasn’t changed ever since Google+ was launched, there is no easy bookmark solution in Google+. In Reader, you could star an item to bookmark it for later reading, but there is no way to see all the items you +1’d on Google+. The closes thing I’ve come across is to have a circle called “Bookmarked” and reshare to that circle to bookmark posts.

thinkberg said that this makes perfect sense for Google because they’re only interested in the +1s to improve their search results, not to provide value to the users. Of course, he was just joking, but there is some truth to this. After all, Google+ is a free service and to make sense to Google they need to leverage the information collected there in some other ways, for example in personalized search.

As thinkberg tweeted earlier

We will have to rethink the culture of free products (ad supported incl.). It’s most always a loss for the customer in the long run.
— Matthias L. Jugel (@thinkberg) March 14, 2013

the actual problem might also be that these are all free services. Such services are always hard because companies must find ways to exploit the data in some way to still make money, and customers or users have a hard time complaining if the service is changed in fundamental ways.

Or put differently, if I’d paid for Google Reader, I’d be pretty pissed now.

Having paying customers on the other hand makes everything so much easier. The company knows why it’s having a product, and it can also support small user bases. The problem is, especially with new technology, to get people to realize that a service is worth real money, but that is another topic.

Alternative RSS feed readers: there are none

So what about the other alternatives? Well, the problem is that Google Reader pretty much killed the market for classical RSS feed readers like bloglines. Most of the other companies mentioned in the various lists have taken a different approach, trying to create some nice looking newspaper-like experience from your feeds. Unfortunately, as I said above, the strength of Google Reader was to be able to quickly sift through huge amounts of information, and that’s something these newspaper-like apps cannot accomplish.

I personally like prismatic a lot, but that’s again something different. It’s pretty good at finding stuff which is interesting, but you cannot track hand selected news sources that way.

What I’d like to have

So just to compile this for the posterity, what are the features I’d like to see:

Track a number of RSS feeds (probably also other sources).
Choice between compact view (one line per item is enough) and expanded view.
Web view and mobile app, synchronized.
Bookmarking capabilities.
Sharing feature, but don’t just put everything in my timeline, let me see only shared items or even by specific user.
Hide stuff I’ve already seen.
I’d be ready to pay around 5$ per month for this.

Basically, someone give me the old Reader, and you’ll have me as a customer instantly.

Oh and in case you haven’t seen this yet:

Posted by Mikio L. Braun on 2013-03-14 15:35:00 +0000

More Google Big Data papers: Megastore and Spanner

I recently reviewed a number of Google Big Data papers. Today I’ll talk a bit about Megastore and Spanner, two distributed databases I wasn’t aware of before. Both are distributed databases with interesting features like full consistency and transactions, something most NoSQL proponents will tell you is impossible to scale.

Now I should mention that I’m not a distributed systems guy by training, and the papers are sufficiently vague in the exact details of how they work (although they otherwise do an excellent job of describing all the essential components of such a complex system), but anyway, here is what I understood ;)

While these papers are probably more Big Data Base than Big Data Science, I still think these make an interesting read because they tell you a lot about what it takes to build a scalable and robust system.

Megastore: BigTable + transactions + schema

You might not have heard of Megastore, but if you’ve used any of Google’s products beside search, you will have interacted with it as things like GMail, Calendar, Picasa (now morphed into Google+), Android Market (now “Google Play”), or AppEngine run on Megastore.

Megastore is build upon BigTable, Google’s key-value store which inspired countless open-source NoSQL databases like Apache Cassandra. However, Megastore isn’t schema-free but lets you define tables like in a standard SQL database. The mapping to BigTable columns is straightforward and you can also specify to collocate dependent tables in the same BigTable for better performance.

Probably the most interesting feature is that Megastore is globally consistent (or ACID compliant, to be more exact). The authors argue that looser consistency is nice for scaling, but for the application developer, consistency is so much easier to work with. And actually, I think they’re right. We’ve all heard the “eventually consistency is enough” talk a few times and have come to believe that often, it doesn’t really matter. But the truth is, to know that the value you wrote is actually there, or to know that a group of related updates is rolled back if your program crashes, is very valuable.

Consistency through distributed commit log

So how does Megastore achieve distributed consistency (and by the way I’m using the term consistency you already see I’m not a distributed systems guy ;))?

The main idea seems to be that Megastore manages a distributed commit log. So-called write ahead logs are pretty common in databases to guard against failures. Every write action is first recorded in a log so that you can pick up after a crash and reapply the write operations in the log. In addition, the log also gives you a time ordering for the transactions.

Contrast this with the write actions in, for example, Cassandra: There, writes can be initiated by any node and make sure a certain number of replicates have acknowledged the write. The end effect is that different nodes might see different orderings for the updates, or some not at all. Cassandra has other mechanisms like read-repair to make sure all nodes eventually have the same data, but there is no guarantee.

So how does Megastore achieve a distributed commit log? The standard way for distributed commits seems to be two-phase commit, which however requires a master node. Instead, Google used the Paxos protocol, which is a basically a voting scheme to ensure that a majority of agents agree on a proposed value. Just to clarify, it is not about voting between a number of alternatives, it’s really about agreeing on a given value in a robust manner to ensure that at least half of the agents have noticed and agreed to the number.

The Paxos algorithm

This algorithm was originally published by Leslie Lamport (yes, the LaTeX Leslie Lamport) in a paper which was written as if it reported on some historical Greek community (also includes a bunch of Greek symbols), but if you’re interested, I recommend the “Paxos made simple” paper which explains it in plain English.

So just to give you an idea how this algorithm works, the algorithm goes through numbered rounds. In each round, the number of the round is first announced to all nodes, and the nodes return a promise to forget about all previous rounds. If a promise was received from the majority of nodes, another attempt is made to get a majority to accept the value. If that works, the value is broadcast to all who are interested. It seems there are ways to make this all robust, including the election of a leader and the identification of the round number such that they are always larger than all previous rounds, just check the original paper for the details ;).

Now voting on a value sounds pretty abstract, but the trick used in Megastore is to use Paxos for reaching agreement on what will be appended to the commit log. Basically, the initiating node says “OK, I’d like to work to commit on this transaction here next”, and if all agree, it is ensured that the majority of nodes has synchronized commit logs.

Wrap-up: Read and write in Megastore

So on each transaction start, Megastore first identifies the last transaction committed and identifies a node which is up-to-date, or brings a node up-to- date. Then, Megastore reads the values of the transaction’s timestamp (BigTable stores multiple versions with timestamp of each value). To commit the transaction, Paxos is used to get agreement on appending the transaction to the commit log.

These are the main ideas, there is a host of other features, like “dirty reads”, “entity groups” which partition the data (and also restrict the consistency to within entity groups), a messaging system, two-phase commit for transactions across entity groups, optimizations over plain Paxos to get fewer rounds of communications, management servers to track which nodes are up-to-date, and so on.

Spanner: one atomic clock per data center

I’ll just briefly mention Spanner, yet another distributed database. The main improvements over Megastore are a SQL-like query language, better write performance (Megastore writes seem to usually take a few hundred milliseconds), and global transactions and consistency (not just within entitiy groups as in Megastore). So basically, Spanner is a distributed database which doesn’t feel different from a “normal” SQL database. Spanner has been developed to be the backend for F1, yet another RDBMS behind Google’s online ad business, and it has replaced a farm of sharded MySQL servers.

Spanner again heavily uses Paxos for various forms of coordination, but also classical two-phase commits. The main feature seems to be that Spanner is able to assign globally meaningful timestamps to commits through the use of GPS and atomic clocks. Yes, I’m not kidding, apparently, you can have rack mount atomic clocks for a few thousand dollars.

Using such clocks, Spanner globally agrees on a commit timestamp using Paxos, giving you global consistency: On a read, you get the timestamp of the last commit and retrieve the date at that timestamp. That way, Spanner also realises lock-free read-only transactions.

Distributed Consistency is hard, but not impossible

So what can we learn from these examples? What I sort of admire is Google’s courage to take on such complex problems. You often hear that global consistency is impossible to scale or that distributed transactions are a nightmare. While this is certainly massively complex technology, it is possible nevertheless. You might want to invest in some atomic clocks, though ;)

I’ve been bashing Cassandra a bit, but one also has to keep in mind that global consistency comes at a cost. Megastore writes are slow. It’s ok if you’re in an interactive application and only need to perform one transaction to complete an action, but it’s probably the wrong backend for data crunching. Likewise, Spanner’s latency is reported to be 8ms for reads on average, and 72-100ms for commits on average, which is incredibly fast given what they accomplish, but still slower than the performance you get out of Cassandra.

So it might be hard, but it’s possible. All too often, people tell us it’s impossible but sometimes they’re just defending their favorite tools feature set.

Related posts:

Big Data beyond MapReduce: Google’s Big Data papers

Posted by Mikio L. Braun on 2013-03-11 15:35:00 +0000