Cassandra/Tuning

Compaction

List current compactions: nodetool compactionstats

Check the compaction status of a particular keyspace:

nodetool -h restbase1005.eqiad.wmnet cfstats local_group_wikipedia_T_parsoid_html.data
(...)
SSTables in each level: [23/4, 52/10, 190/100, 1269/1000, 2817, 0, 0, 0, 0]

This means that level 0 (new data, potentially overlapping) has 23 sstables, when it really should only have four (the /4 is only shown if the actual number is above the threshold). This means that compactions for this keyspace are behind. Since level 0 sstables can overlap, this can cause a high number of sstables to be read for a query, which is bad for latency. See this stack exchange post for more background on leveled compaction.

Concurrency

Set concurrent_compactors to 10 to avoid compaction being CPU-bound. With leveled compaction there can be many concurrent compactions on the same keyspace; we observed 10 concurrent compactions for a single, large keyspace in prod.

Throughput

Compaction throughput affects GC pressure and iowait. Generally, compaction throughput should leave enough headroom for request processing without backing up. This means that iowait should average at most 5% when compacting at full concurrency. In production (3x Samsung SSD RAID0), this is achieved with compaction_throughput_mb_per_sec: 120. A good method to check this is to dynamically set a throughput using

for i in `seq 1 6`; do

  nodetool -h restbase100$i.eqiad.wmnet setcompactionthroughput 120

done

The throughput seems to only apply to new compactions, so wait until all ongoing compactions listed in nodetool compactionstats are done & replaced with new ones before starting to watch vmstat for iowait.

Read and write concurrency

Limiting the concurrency of read and write request processing is a major load limiting and -shedding knob in Cassandra.

A moderate concurrent_reads (ex: 18) has a limiting effect on heap pressure induced by read bursts. At 96, we saw nodes spiraling into OOM death after heavy reads, which in turn led to writes backing up.

Similarly, a concurrent_writes of 18 or below limits memory used during write bursts. Additionally, Cassandra relies on relatively low (2s) write timeouts to avoid OOM. Setting that timeout to 5s significantly increased OOM probability under load, at least in combination with a high concurrent_writes setting of 96. It might be worth testing a low concurrent_writes of 16-18 in combination with a longer write timeout.

SSTable indexing interval

Cassandra maintains index offsets per partition to speed up the lookup process in the case of key cache misses (see cassandra read path overview). By default it samples a subset of keys, somewhat similar to a skip list. The sampling interval is configurable with min_index_interval and max_index_interval CQL schema attributes (see describe table). For relatively large blobs like HTML pages we seem to get better read latencies by lowering the sampling interval from 128 min / 2048 max to 64 min / 512 max. For large tables like parsoid HTML with ~500G load per node this change adds a modest ~25mb off-heap memory.