Marginally Interesting: Some Tips On Using Cassandra

Also see this follow-up post.

The last few months we have been working on migrating our main database for twimpact to Cassandra. When we first started using Cassandra, using the default configuration we quickly ran into all kinds of performance and stability issues. For example, Cassandra would consume huge amounts of memory and at some point do nothing else than garbage collections. Or, Cassandra requests would time out again and again.

It turns out that Cassandra, like any sufficiently complex piece of software, requires proper configuration and usage. In this post, I try to summarize some basic facts about Cassandra and our experience with configuring it properly. The post is a bit longer, so here are the main points first:

Monitor Cassandra for memory consumption and stage throughput.
Set your Memtable thresholds low.
Access Cassandra concurrently.
Don’t store all your data in a single row.
Beware of time-outs.

Small architecture overview of Cassandra

OK, here is the absolute minimum you need to know to understand what I’m talking about. Note that I’ll only discuss single node configurations. With clustering, it becomes somewhat more complex.

Cassandra basic unit of storage is a column family which is like a hash of hashes. Put differently, you get a keyed index into objects stored as hashes.
Cassandra writes are first written to an on-disk commit log (in case the server crashes), and then written to in-memory Memtables. If these get full, they are flushed to on-disk SSTables (which are immutable). Background threads then perform garbage collection and compaction on these SSTables.
Cassandra is written following the SEDA principle. This means that Cassandra basically is a collection of independent modules called stages which communicate through message queues. These modules can dynamically adapt their workload by either adjusting the number of worker threads or by throttling their operation. The most important stages are ROW-MUTATION-STAGE (which takes care of writes) and READ-ROW-STAGE (doing reads). See also this blog post for more detailed information.
Cassandra is configured through the file conf/storage-conf.xml relative to where Cassandra is started.

Monitoring Cassandra

The standard tools for monitoring Java processes also apply to Cassandra, including jvisualvm and jconsole which are included in Sun’s^H^H^H^HOracle’s, SDK.

Cassandra also exposes a number of management beans (see the tab “MBeans” in jconsole). The interesting ones are:

org.apache.cassandra.service.StorageService has statistics about read and write latencies.
In org.apache.cassandra.db.ColumnFamilyStores you find beans for each column family where you can see how many bytes they are using, how many SSTables exist (the on-disk storage).
Under org.apache.cassandra.concurrent, you can see the individual stages in the Cassandra server.

You can also get a quick overview of the stages with the nodetool command. Let’s say Cassandra is running on 127.0.0.1, then you type bin/nodetool -host 127.0.0.1 tpstats and get an overview over all stages and how many active and pending jobs you have.

If everything is running okay, you will normally see a 0 or 1 in the Active and Pending column, and a really huge number in the Completed column. If one of the stages (in particular the ROW-READ-STAGE) has a large number of pending jobs, then something is amiss (see below).

Set low Memtable thresholds

*Update*: As some commentors pointed out, the following paragraph is misleading with respect to the different kinds of garbage collection. The parallel garbage-collector is actually a stop-the-world garbage collector working on the so-called “young generation”. The term “parallel” refers to it being implemented in a multi-threaded fashion for better speed. The ConcurrentMarkSweep GC is a “Full GC”, but apart from the initial mark and the final cleanup phase, it is really concurrent, meaning that it runs in parallel to the normal process.

What remains true is that Cassandra probably needs a lot more memory than a back-of-the-envelope computation based on the Memtable sizes would suggest. But it is not a problem to have it run in a stable manner. For more information, see the follow-up blog post.

Probably the most important configuration options are the thresholds which control when Memtables are flushed. Here is a list of the options: (The Cassandra Wiki is not up-to-date with the names. MemtableSizeInMB became MemtableThroughputInMB in 0.6)

MemtableThroughputInMB	How many megabytes of data to store before flushing
BinaryMemtableThroughputInMB	The same for a special kind of Memtable (haven't met those one yet)
MemtableOperationsInMillions	How many operations on Memtable before flushing
MemtableFlushAfterMinutes	How many minutes to wait before flushing

All these parameters apply per column family.

Now, you might think, if you have ten column families and MemtableThroughputInMB is 128 (the default!), then you will need about 1.3 GB of RAM. However, for reasons not totally clear to me, Cassandra will use up a lot more RAM. In our twimpact installation we currently have 12 column families. If I set the limit to 1MB for each column family, Cassandra needs at least 512MB. On our main database, I’ve currently set the limits to 16MB and after some burn-in time, Cassandra happily expands to first 10GB, then the whole 16GB allotted.

You should closely monitor the memory consumption using a tool like jconsole. Cassandra’s performance will be much worse if you don’t give it enough RAM as it will at some point start to do mostly garbage collection. In particular, you have to watch out for the ConcurrentMarkSweep GCs. Cassandra is configured to use a parallel garbage collector which has two kinds of collection: ParNew runs totally in parallel to other threads and collects small amounts of data. If that does not suffice, the JVM performs a full GC with ConcurrentMarkSweep which takes much longer.

Here is a screenshot of our Cassandra process after running for about a day. This behavior should be considered “normal”. In particular, the JVM hasn’t feel forced to do a full GC so far.

I’ve searched the code a bit and done some profiling with jvisualvm, but I haven’t been able to really understand what is going on. Apparently, all the data is contained in Memtables, but normally this data should be freed once the Memtable is flushed. Cassandra is using soft references to trigger deletion of some files, and I’ve seen soft references leading to drastically different memory behavior (read: more memory), but I’m not sure this is really the case here. There is also a bug report which seems to address a similar problem. Restarts also help, but are certainly out of question in production.

In summary, for now give Cassandra a lot of RAM and everything should be OK.

Cassandra loves concurrency

Cassandra really loves concurrent access. You can get at least a four fold increase of throughput if you access Cassandra from four to eight threads at the same time.

For twimpact, it was actually a bit hard to turn the analysis multi-threaded because we need to update a lot of statistics and there is no provision for locking in Cassandra, but it certainly was worth it.

One thing to remember, though, is to adjust the two configuration options ConcurrentReads and ConcurrentWrites accordingly. These two directly set the number of worker threads for the ROW-READ-STAGE and the ROW-MUTATION-STAGE, meaning that these aren’t adjusted automatically. It probably makes sense to have at least as many threads as you expect to use. However, as we’ll discuss next, reads can take a long time, so it’s probably safe to have enough threads there.

Don’t store all your data in a single row

One operation where Cassandra isn’t particularly fast is when you store all your data in a single row (i.e. under a single key). Since the columns are automatically sorted, single rows are useful to store indices, or other kinds of sorted lists. But you should be careful not to store several hundred thousand objects in there, otherwise reads and writes will get pretty slow (I’m talking about several seconds here, not milliseconds).

In our case, tweetmeme was such a case since they apparently have many different messages retweeted (even more than justinbieber, our other gold standard when it comes to excessive number of retweets ;)).

Beware of timeouts

Depending on the client you are using, there might be some timeouts on the sockets (which is generally a good thing). In addition there is also the configuration parameter RpcTimeoutInMillis, which is an internal timeout for requests.

Be careful not to set those parameters too low, I recommend 30 seconds or longer. Reads can take up to several seconds (see above) if you accidentally have really large rows. The problem is that unlike network problems, the request will time out on each retry, and your application will basically hang.

Summary

In summary, I think Cassandra is a very interesting project. They certainly have the right combination of technology in there: Dynamo’s fully distributed design, Bigtable’s main storage principles, robustness to failures. In our second reanalysis run, we’re getting still about 290 tweet/s analyzed with already 50 million in the database (remember that each tweet analysis involves about 15-20 operations with read/write ratio of about 1:2). On the other hand, getting there probably requires a bit more configuration and monitoring than for more mature databases.

Posted by Mikio L. Braun at 2010-08-11 16:56:00 +0000

A Rejected Paper Is Not The End

jblas: 1.1 Released and Some Examples