Tuesday, August 24, 2010

Cassandra Garbage Collection Tuning

File under: Machine Room

Last week, I discussed our experiences with configuring Cassandra. As some of the commenters pointed out, my conclusions about the garbage collection were not fully accurate. I did some more experiments with different kinds of GC settings, which I will discuss in this post. On the bottom line, it is not that hard to get Cassandra run in a stable manner, but you still need a lot more heap space than a back-of-the-envelope computation based on the Memtable sizes would suggest.

This post got a bit technical and long as well. The take-home message is that I obtained a much more stable GC behavior with setting the threshold at which the garbage collections of the old generation occurred to a lower value, leaving a bit more headroom. With that, I managed to run Cassandra in a stable manner with 8GB of heap space.

Short overview of Java garbage collection

(For the full details, see the article Memory Management in the Java HotSpot® Virtual Machine)

Java (or more specifically the HotSpot JVM) organizes this heap into two generations, the “young” and the “old” generation. The young generation furthermore has a “survivor” section where objects which survive the first garbage collection are stored. As objects survive several generations, they eventually migrate to the old generation.

The JVM runs two different kinds of garbage collections on the two generations. More frequent ones on the young generation, and less frequent ones on the older generations. The JVM implements different strategies:

There are many different “-XX” flags to tune the garbage collection. For example, you can control the size of the young generation, the size of the survivor space, when objects will be promoted to the old generation, when the CMS is triggered, and so on. I’ll try to give an overview of the most important options in the following section.

Tuning GC for Cassandra

GC parameters are contained in the bin/cassandra.in.sh file in the JAVA_OPTS variable. One tool to monitor is the jconsole program I already mentioned which comes with the standard JDK in the bin directory. With it, you can also look at the different generations (“young eden”, “young survivor”, and “old”) separately. Note however, that jconsole does not seem to count CMS GCs correctly (counter is always 0). It is better to enable GC logging with

JAVA_OPTS=" \
  ...
-XX:+PrintGCTimeStamps \
-XX:+PrintGCDetails \
-Xloggc:<file to log gc outputs to>"

As I said the previous post the most important part is to give yourself sufficient heap space. Still, there are other options you might wish to tweak.

Control when CMS kicks in

I’ve noticed that often a CMS GC would only start after the heap is almost fully populated. Since the CMS process takes some time to finish, the problem is that the JVM runs out of space before that. You can set the percentage of used old generation space which triggers the CMS with the options

Without these options, the JVM tries to estimate the optimal percentage itself, which sometimes leads to too late triggering of the CMS cycle.

Incremental CMS

I found recommended options for incremental CMS in this article:

-XX:+UseConcMarkSweepGC \
-XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing \
-XX:CMSIncrementalDutyCycleMin=0 \
-XX:+CMSIncrementalDutyCycle=10

With these options, the mark and sweep will run in small slices leading to lower CPU load. Interestingly, incremental CMS also leads to somewhat different heap behaviors (see below).

Tuning young generation size and migration.

From this very good tutorial on GC training is the following quote: “You should try to maximize the number of objects reclaimed in the young generation. This is probably the most important piece of advice when sizing a heap and/or tuning the young generation.”

There are basically two ways to control the amount of re-used objects: The size of the young generation and the way when they get promoted to the old generation.

Here are the main options for tuning the size of the young generation.

The size of the survivor space and the migration is controlled by the following options:

There still more options, check out the talk above for more information.

Some experiments

In order to test the different settings, I ran the following benchmark: I ran a somewhat memory constrained Cassandra instance with only 1MB per Memtable, and 400MB initial and maximal heap. Then, I analyzed about 700,000 tweets. Such a run took about 1 hour on my laptop (2GB RAM, Intel Core2 @ 2GHz, 500GB Seagate Momentus 7200.4 hard drive). I did some light coding during the tests.

I compared the following settings:

  -XX:SurvivorRatio=8
  -XX:MaxTenuringThreshold=1
  -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:+CMSParallelRemarkEnabled
  -XX:SurvivorRatio=8
  -XX:MaxTenuringThreshold=1
  -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:CMSInitiatingOccupancyFraction=80
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:+UseConcMarkSweepGC
  -XX:ParallelCMSThreads=1
  -XX:+CMSIncrementalMode
  -XX:+CMSIncrementalPacing
  -XX:CMSIncrementalDutyCycleMin=0
  -XX:CMSIncrementalDutyCycle=10

The following plot shows the size of the used old generation heap, both as a smoothed mean (running mean over 5 Minutes), and the cumulative maximum. Note that the raw data is much more jagged, basically covering the space between the line cms-incremental and the respective maximum.

In principle, all settings worked quite well, getting a bit of stress from the garbage collector as the heap ran full. This sometimes resulted in time outs as requests took a bit too long.

From this plot we can conclude that

Conclusion

In summary, a bit of garbage collection tuning can help to make Cassandra run in a stable manner. In particular, you should set the CMS thresholds a bit lower, and probably also experiment with incremental CMS if you have enough cores. Setting the CMS threshold to 75%, I got Cassandra to run well in 8GB without any GC induced glitches, which is a big progress from the previous post.

Credits: Cassandra monitoring with JRuby and jmx4r, plotting with matplotlib.

Update

Based on Oleg Anastasyev suggestion, I re-ran the experiments and also logged the CPU usage. I ran the new experiments on one of our 8 core cluster computers which had no other workload, because my Laptop has only two cores.

Since the server is a 64 bit machine, I had to adjust the minimal heap from 400MB up to 500MB, as pointers need a bit more space and Cassandra ran out of memory with 400MB. I also changed the setting to fix the size of the young generation to 100MB with the -Xmn option (above, I haven’t specified it, which lead to different sizes for the default and the CMS garbage collections). In addition, I used the option -XX:ParallelCMSThreads=4 to restrict the number of threads used in the incremental CMS to 4 (the server had 8 available cores).

Here are the results for the heap (mean averaged over 60 seconds):

And the results for the CPU load (running average over 30 seconds):

As you can see, the incremental version does not perform as good as in the original case above (apart from the 32 bit vs. 64 bit distinction, I used the same version of the JVM. However, the server was about twice as fast which probably lead to different overall timings.)

Fixing the size of the young generation also reveals that the JVM default GC does not performs as well as the others. Still, the setting with lower CMS threshold performs best. Also if you look at the CPU charts, the incremental CMS leads to only slightly larger CPU load, but it doesn’t really look like it is statistically significant.

One final remark with the CMS threshold: You should of course make sure that the CMS threshold is larger than the normal heap usage. As the size of used heap increases as you add more data (index tables growing, etc.), you should check from time to time if you have enough memory. Otherwise, you will get continuous triggering of the CMS collection.

Posted by Mikio L. Braun at Tue Aug 24 14:30:00 +0200 2010

blog comments powered by Disqus