Data Platform/Systems/AQS/Scaling/2016/Hardware Refresh

Goals

Replace the current AQS Restbase cluster with a more performant one. aqs100[123] nodes are replaced by aqs100[567], that are more performant and equipped with SSD disks.

AQS service

The Analytics Query Service is a Restbase service, most notably hosting the PageViews service. More info: Analytics/AQS

Current cluster

aqs100[123].eqiad.wmnet
Single Cassandra instance per node.
Old rotating disks.
Peaks at ~30rps
One year of pageviews data takes about 2TB per node
Cassandra 2.1.13

Problems

Throttled by the main Restbase cluster to avoid Cassandra read timeouts (see Analytics/AQS#Throttling).
Rotating disks suffer too much from reads (AQS load is more on read than writes).
Medium/Long term (2~3 years) storage will need TBs per host.

Traffic analytics

77.5% of the traffic for Pageviews per article, 22.5 top views per project and the rest aggregate page views per project.
90% of the requests are for data ranging a maximum of 50 days.
80% of the per article Pageviews end up in a HTTP 200 (but a Varnish cache miss), meanwhile 50X are only less than 0.1% (but they trigger our alarms anyway). The high data granularity for per article variations explains the high cache miss but there should be space for improvements.
93% of top views per project end up in a HTTP 200 (Varnish cache hit). Coarse data granularity compared to the per article request variation.

Note: Please ask to analytics access to the spreadsheet with data if you want more info.

Requirements for the new cluster

At least the last 18 months of Pageviews data (approx ~2TB per Cassandra instance) + extra space for Cassandra housekeeping (compactions mostly)
req/s served by the Restbase cluster peaking at ~300 rps (one order of magnitude more than the current cluster)

Cluster overview

Overview of the new AQS cluster. Three physical nodes and six cassandra instances (two for each node).

This configuration has been discussed in https://phabricator.wikimedia.org/T133785 by the Analytics and Services:

The new nodes have 8 1.6T SSD disks and 64GB of RAM, so the suggestion was to run two Cassandra instances on the same node to end up with a cluster of six instances.
Each Cassandra instance takes care of one sixth of the data keyspace.
Each key is replicated three times and rack awareness configuration ensures that the replicas live on different physical nodes.
The Cassandra version deployed is 2.2.6. This version has been tested in Production by the Services team and the migration from 2.1.13 turned out to be difficult for live nodes (SSTable format changes, etc..)

Disks layout

Each cassandra instance will use a separate mountpoint to store SSTables - /srv/cassandra-(a|b).
Each cassandra instance mountpoint will run on a separate 4 disk based RAID10 array, same settings for the root partition too (co-located with one of the instance's RAID10 array).
Each disk has been partitioned using partman, hence they have the same identical configuration.

Overview of the new disk layout of each AQS node.

Load testing

Analytics/AQS/Scaling/LoadTesting

Cassandra consistency

Cassandra extends the concept of eventual consistency delegating to the client the choice of choosing the desired level of accuracy (Datastax docs). For example, with a replication factor of three and WRITE consistency set to QUORUM, a client requires at least the majority of the replicas to agree about a write before considering it done. A READ ONE consistency level returns the value of the quickest replica answering to a query. This is an important gotcha to properly evaluate the failure scenarios.

Analytics currently uses WRITE QUORUM and READ ONE consistency levels.

Another interesting read is about read repairs. Quoting the docs:

"When data is read to satisfy a query and return a result, all replicas are queried for the data needed. The first replica node receives a direct read request and supplies the full data. The other nodes contacted receive a digest request and return a digest, or hash of the data. A digest is requested because generally the hash is smaller than the data itself. A comparison of the digests allows the coordinator to return the most up-to-date data to the query. If the digests are the same for enough replicas to meet the consistency level, the data is returned. If the consistency level of the read query is ALL, the comparison must be completed before the results are returned; otherwise for all lower consistency levels, it is done in the background.

The coordinator compares the digests, and if a mismatch is discovered, a request for the full data is sent to the mismatched nodes. The most current data found in a full data comparison is used to reconcile any inconsistent data on other replicas."

Failure scenarios

One disk failure

Since we are using only RAID10 arrays, one disk failure will not cause any relevant issue or service degradation.

Two disks failures

Worst case scenarios:

two disks belonging to the same RAID10 array (except the one for the root partition) could bring an entire Cassandra instance offline. The cluster will keep working, achieving both READ and WRITE consistency levels, but if the load is really high we might see some performance degradations (especially in the overall throughput).
two disks belonging to the same RAID10 root partition, bringing down an entire node and cluster down to two nodes (four instances). Since rack awareness guarantees that no data replica is present on the same node, we are reasonably sure to achieve consistency for WRITE and READ. Performances under load could get severely impacted, but the cluster will keep working.

Three disks failures

Three disk failures should not get the cluster in more trouble than the worst case scenarios for two disks failures.

Four disks failures

Worst case scenario:

two disks failures in two different RAID10 arrays containing the root partition. This will cause four Cassandra instances down and the cluster completely impaired (no READ/WRITE quorum achievable).

There are other scenarios (combinations of the previous ones listed above) that would allow the cluster to run even with four disks failures.

Combined failures

We need to keep in mind that other kind of failures, like rack down, complete host breakage, etc.. could add up to one of the regular disks failure scenarios described above. One node down means two Cassandra instances down, so a combined failure would of course lower down the disks failure resiliency.

Alternatives

We have explored the possibility to use RAID0 arrays for each Cassandra instance, keeping RAID10 only for the root partitions. It would give us a lot more disk space available but it would also lower down the overall resiliency of the cluster (with two disks failures we could be in serious trouble).