Marginally Interesting: More Google Big Data papers: Megastore and Spanner

I recently reviewed a number of Google Big Data papers. Today I’ll talk a bit about Megastore and Spanner, two distributed databases I wasn’t aware of before. Both are distributed databases with interesting features like full consistency and transactions, something most NoSQL proponents will tell you is impossible to scale.

Now I should mention that I’m not a distributed systems guy by training, and the papers are sufficiently vague in the exact details of how they work (although they otherwise do an excellent job of describing all the essential components of such a complex system), but anyway, here is what I understood ;)

While these papers are probably more Big Data Base than Big Data Science, I still think these make an interesting read because they tell you a lot about what it takes to build a scalable and robust system.

Megastore: BigTable + transactions + schema

You might not have heard of Megastore, but if you’ve used any of Google’s products beside search, you will have interacted with it as things like GMail, Calendar, Picasa (now morphed into Google+), Android Market (now “Google Play”), or AppEngine run on Megastore.

Megastore is build upon BigTable, Google’s key-value store which inspired countless open-source NoSQL databases like Apache Cassandra. However, Megastore isn’t schema-free but lets you define tables like in a standard SQL database. The mapping to BigTable columns is straightforward and you can also specify to collocate dependent tables in the same BigTable for better performance.

Probably the most interesting feature is that Megastore is globally consistent (or ACID compliant, to be more exact). The authors argue that looser consistency is nice for scaling, but for the application developer, consistency is so much easier to work with. And actually, I think they’re right. We’ve all heard the “eventually consistency is enough” talk a few times and have come to believe that often, it doesn’t really matter. But the truth is, to know that the value you wrote is actually there, or to know that a group of related updates is rolled back if your program crashes, is very valuable.

Consistency through distributed commit log

So how does Megastore achieve distributed consistency (and by the way I’m using the term consistency you already see I’m not a distributed systems guy ;))?

The main idea seems to be that Megastore manages a distributed commit log. So-called write ahead logs are pretty common in databases to guard against failures. Every write action is first recorded in a log so that you can pick up after a crash and reapply the write operations in the log. In addition, the log also gives you a time ordering for the transactions.

Contrast this with the write actions in, for example, Cassandra: There, writes can be initiated by any node and make sure a certain number of replicates have acknowledged the write. The end effect is that different nodes might see different orderings for the updates, or some not at all. Cassandra has other mechanisms like read-repair to make sure all nodes eventually have the same data, but there is no guarantee.

So how does Megastore achieve a distributed commit log? The standard way for distributed commits seems to be two-phase commit, which however requires a master node. Instead, Google used the Paxos protocol, which is a basically a voting scheme to ensure that a majority of agents agree on a proposed value. Just to clarify, it is not about voting between a number of alternatives, it’s really about agreeing on a given value in a robust manner to ensure that at least half of the agents have noticed and agreed to the number.

The Paxos algorithm

This algorithm was originally published by Leslie Lamport (yes, the LaTeX Leslie Lamport) in a paper which was written as if it reported on some historical Greek community (also includes a bunch of Greek symbols), but if you’re interested, I recommend the “Paxos made simple” paper which explains it in plain English.

So just to give you an idea how this algorithm works, the algorithm goes through numbered rounds. In each round, the number of the round is first announced to all nodes, and the nodes return a promise to forget about all previous rounds. If a promise was received from the majority of nodes, another attempt is made to get a majority to accept the value. If that works, the value is broadcast to all who are interested. It seems there are ways to make this all robust, including the election of a leader and the identification of the round number such that they are always larger than all previous rounds, just check the original paper for the details ;).

Now voting on a value sounds pretty abstract, but the trick used in Megastore is to use Paxos for reaching agreement on what will be appended to the commit log. Basically, the initiating node says “OK, I’d like to work to commit on this transaction here next”, and if all agree, it is ensured that the majority of nodes has synchronized commit logs.

Wrap-up: Read and write in Megastore

So on each transaction start, Megastore first identifies the last transaction committed and identifies a node which is up-to-date, or brings a node up-to- date. Then, Megastore reads the values of the transaction’s timestamp (BigTable stores multiple versions with timestamp of each value). To commit the transaction, Paxos is used to get agreement on appending the transaction to the commit log.

These are the main ideas, there is a host of other features, like “dirty reads”, “entity groups” which partition the data (and also restrict the consistency to within entity groups), a messaging system, two-phase commit for transactions across entity groups, optimizations over plain Paxos to get fewer rounds of communications, management servers to track which nodes are up-to-date, and so on.

Spanner: one atomic clock per data center

I’ll just briefly mention Spanner, yet another distributed database. The main improvements over Megastore are a SQL-like query language, better write performance (Megastore writes seem to usually take a few hundred milliseconds), and global transactions and consistency (not just within entitiy groups as in Megastore). So basically, Spanner is a distributed database which doesn’t feel different from a “normal” SQL database. Spanner has been developed to be the backend for F1, yet another RDBMS behind Google’s online ad business, and it has replaced a farm of sharded MySQL servers.

Spanner again heavily uses Paxos for various forms of coordination, but also classical two-phase commits. The main feature seems to be that Spanner is able to assign globally meaningful timestamps to commits through the use of GPS and atomic clocks. Yes, I’m not kidding, apparently, you can have rack mount atomic clocks for a few thousand dollars.

Using such clocks, Spanner globally agrees on a commit timestamp using Paxos, giving you global consistency: On a read, you get the timestamp of the last commit and retrieve the date at that timestamp. That way, Spanner also realises lock-free read-only transactions.

Distributed Consistency is hard, but not impossible

So what can we learn from these examples? What I sort of admire is Google’s courage to take on such complex problems. You often hear that global consistency is impossible to scale or that distributed transactions are a nightmare. While this is certainly massively complex technology, it is possible nevertheless. You might want to invest in some atomic clocks, though ;)

I’ve been bashing Cassandra a bit, but one also has to keep in mind that global consistency comes at a cost. Megastore writes are slow. It’s ok if you’re in an interactive application and only need to perform one transaction to complete an action, but it’s probably the wrong backend for data crunching. Likewise, Spanner’s latency is reported to be 8ms for reads on average, and 72-100ms for commits on average, which is incredibly fast given what they accomplish, but still slower than the performance you get out of Cassandra.

So it might be hard, but it’s possible. All too often, people tell us it’s impossible but sometimes they’re just defending their favorite tools feature set.

Related posts:

Big Data beyond MapReduce: Google’s Big Data papers

Posted by Mikio L. Braun at 2013-03-11 15:35:00 +0000

Stream Processing has no Query Layer

Lunch talk: Google Reader