Cassandra/Hardware

This information is outdated.

Instance sizing

Cassandra instance size is limited due to JVM GC and some architectural limitations^[1]. With the default CMS collector, the recommendation is to use heap sizes no larger than 8G in order to maintain reasonable GC pause times. With G1GC, we have been able to push it 16G without falling over, but the metrics are showing occasional timeout spikes coinciding with larger (5s) GC spikes.

The Cassandra documentation recommends: "Most workloads work best with a capacity under 500GB to 1TB per node depending on I/O. Maximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data." With our current compression levels of around 17%, this translates to no more than 500-850G of compressed storage per node.

This fits with our observations in practice. We started to see some (~a dozen per day) request timeouts at instance sizes of about 600G (compressed). At around 1.2-1.5T compressed storage per node (7-9T raw data), we additionally observed

mean latencies growing by an order of magnitude
timeout rates occasionally bursting into the double-digit percent of requests
nodes crashing from OOM
compaction falling behind, leading to large read latencies, retry rates and larger storage size
failure to bootstrap new nodes
metrics reporting stopping after a few hours of operation

Mapping of instances to hardware

Our current hardware is underutilized from a memory and cpu perspective, and while it may be possible to achieve higher node storage densities for our workload, it is clear that we won't be able to utilize the full 3T with a single Cassandra instance while maintaining acceptable latencies and reliability.

At a maximum instance size of about 600G compressed data, provisioning multiple Cassandra instances per hardware node (via systemd units, containers, or virtualization) will be needed for reasonable utilization.

Hardware characteristics

Chassis and hardware

ideally 8x 2.5" SSD; no hot-swap needed
extras like redundant power not necessarily needed, redundancy is achieved through replication
1G ethernet for single instance, 10G for multi-instance box
JBOD configuration; no HW RAID needed. For larger hardware nodes we could consider RAID-5 for lower recovery times after a single SSD failure.

SSDs

About 1T of raw storage per instance, with a target utilization of no more than 600G.

Cassandra writes in large sequential chunks, which means that write amplification is minimal. Write rates are moderate at ~10M/s per disk, which means that consumer SSDs with an endurance of 2000+ erase cycles will theoretically last about 6.5 years. Once we increase the cluster size, write rates will further decrease and life expectancy accordingly increase.

Candidates:

Samsung 850 Pro: Used in the current Eqiad cluster. About $490 at time of writing.
Samsung 850 Evo: Same flash as the Pro, but smaller processor. Same endurance rating. About $380 at time of writing.

CPU and memory, per instance

4+ cores
min 16G RAM

eqiad RESTBase Cluster

RESTBase was initially deployed in a single data-center configuration to 6 dual-8-core machines with 64GB of memory, and 3T of SSD storage. Cassandra is configured for a replication count of 3, and is using LeveledCompactionStrategy.

Where we are (as of May 2015)

Deployed across 3 data-center rows, (using NetworkTopologyStrategy, 2 nodes per row)
~6T total storage (compressed, includes replication overhead), growing at a rate of ~60GB/day (~1T/node)
Request latencies are nominal, ~100 request timeouts / 24 hours
Cassandra is utilizing ~18G of RAM (RSS)
CPU utilization is typically below 20%

Where we are going / what we expect^[2]

~54T total storage anticipated in the coming year
1 additional data-center, replication factor 3 (54T total compressed storage)
The capacity to maintain service-levels in the face of a data-center row (=cassandra rack) outage

References

↑ The most important architectural limitation is probably the depth of log-structured merge trees. Large data sets with leveled compaction tend to use 5 sstable levels, which means that all data is written at least five times until it is fully merged into the lowest level. Tables with less than 200G per instance tend to get by with four levels instead. Really small tables with less than ~100G per instance only need three levels. Additionally, compaction parallelism is limited by the requirement to process non-overlapping key ranges at each level. This means that overlapping compactions at different levels often block each other, which can result in only a single CPU-bound thread working on a large table, and compaction falling behind as a result.
↑ RESTBase capacity planning for 2015/16

[1] The most important architectural limitation is probably the depth of log-structured merge trees. Large data sets with leveled compaction tend to use 5 sstable levels, which means that all data is written at least five times until it is fully merged into the lowest level. Tables with less than 200G per instance tend to get by with four levels instead. Really small tables with less than ~100G per instance only need three levels. Additionally, compaction parallelism is limited by the requirement to process non-overlapping key ranges at each level. This means that overlapping compactions at different levels often block each other, which can result in only a single CPU-bound thread working on a large table, and compaction falling behind as a result.

[T97692-2] RESTBase capacity planning for 2015/16

[1]

[2]