List current compactions:
Check the compaction status of a particular keyspace:
nodetool -h restbase1005.eqiad.wmnet cfstats local_group_wikipedia_T_parsoid_html.data (...) SSTables in each level: [23/4, 52/10, 190/100, 1269/1000, 2817, 0, 0, 0, 0]
This means that level 0 (new data, potentially overlapping) has 23 sstables, when it really should only have four (the /4 is only shown if the actual number is above the threshold). This means that compactions for this keyspace are behind. Since level 0 sstables can overlap, this can cause a high number of sstables to be read for a query, which is bad for latency. See this stack exchange post for more background on leveled compaction.
concurrent_compactors to 10 to avoid compaction being CPU-bound. With leveled compaction there can be many concurrent compactions on the same keyspace; we observed 10 concurrent compactions for a single, large keyspace in prod.
Compaction throughput affects GC pressure and iowait. Generally, compaction throughput should leave enough headroom for request processing without backing up. This means that iowait should average at most 5% when compacting at full concurrency. In production (3x Samsung SSD RAID0), this is achieved with
compaction_throughput_mb_per_sec: 120. A good method to check this is to dynamically set a throughput using
for i in `seq 1 6`; do
nodetool -h restbase100$i.eqiad.wmnet setcompactionthroughput 120
The throughput seems to only apply to new compactions, so wait until all ongoing compactions listed in
nodetool compactionstats are done & replaced with new ones before starting to watch vmstat for iowait.
Read and write concurrency
Limiting the concurrency of read and write request processing is a major load limiting and -shedding knob in Cassandra.
concurrent_reads (ex: 18) has a limiting effect on heap pressure induced by read bursts. At 96, we saw nodes spiraling into OOM death after heavy reads, which in turn led to writes backing up.
concurrent_writes of 18 or below limits memory used during write bursts. Additionally, Cassandra relies on relatively low (2s) write timeouts to avoid OOM. Setting that timeout to 5s significantly increased OOM probability under load, at least in combination with a high
concurrent_writes setting of 96. It might be worth testing a low
concurrent_writes of 16-18 in combination with a longer write timeout.
SSTable indexing interval
Cassandra maintains index offsets per partition to speed up the lookup process in the case of key cache misses (see cassandra read path overview). By default it samples a subset of keys, somewhat similar to a skip list. The sampling interval is configurable with
max_index_interval CQL schema attributes (see
describe table). For relatively large blobs like HTML pages we seem to get better read latencies by lowering the sampling interval from 128 min / 2048 max to 64 min / 512 max. For large tables like parsoid HTML with ~500G load per node this change adds a modest ~25mb off-heap memory.