Incidents/20160531-RESTBase

From Wikitech

Summary

On May 31, 2016 a Cassandra upgrade of restbase1007.eqiad.wmnet was performed (from 2.1.13 to 2.2.6).  Despite careful preparation, the upgrade did not go as planned.

Timeline

2016-05-31T16:15

All 3 instances of restbase1007 were upgraded from 2.1.13 to 2.2.6.  Afterward it became apparent that Cassandra metrics reporting had ceased due to an unhandled exception propagating up to cassandra-metrics-collector from Cassandra via JMX.

While there were no Cassandra metrics at this time, host-based metrics indicated an increase in disk read throughput of more than 10x, from ~100MBps, to over 1GBps, an increase in network throughput by 3x, and an increase in RESTBase HTML revision request storage latency (99p) from ~280ms to at times over 2s.

2016-05-31T17:15

The source of the unhandled exceptions was narrowed down to a histogram introduced in Cassandra 2.2, ColUpdateTimeDeltaHistogram (see CASSANDRA-11931), and a patched version of cassandra-metrics-collector was live-hacked into place, restoring Cassandra metrics.

With Cassandra metrics restored it became obvious that columnfamily read latency had increased from ~40ms to an oddly consistent 400ms[1] (1007-c), and that the 1007 instances (1007-c in particular), seemed to be shouldering the majority of client requests.

2016-05-31T20:12

The CQL native transport port was disabled on 1007-c for a brief period of time to gauge the effect.  During this period, the additional traffic to 1007-c shifted to 1007-{a,b}.

2016-05-31T21:16

The CQL native transport was disabled on 1007-{a,b,c}.  RESTBase service failures across the cluster ensued, requiring RESTBase restarts to correct.  With client ports shutdown, and RESTBase restarted, client traffic rebalanced across the cluster.  Utilization on 1007 improved dramatically, with network throughput and CPU utilization normalizing.  Only disk read throughput remained aberrant, dropping somewhat from ~1.5GBps to ~1GBps (still many times the norm).

Investigations continued into the evening, but with the node somewhat stable and RESTBase 99p latencies under control, it was left in this state overnight.

2016-06-01T17:47

Largely on a hunch, disk_access_policy was live-hacked to mmap_index_only (defaults to auto) and all instances were restarted.  In our environment, where all data at rest is compressed, this should have the effect of disabling mmap access only for decompression reads, something that was newly introduced in Cassandra 2.2.  With this setting in place, disk read throughput figures normalized immediately.

During the restart of instances, the binary ports were opened momentarily, long enough for some RESTBase instances to pick them up, again resulting in some service failures (and subsequent restarts) after they were again disabled.

2016-06-01T20:00

With the instances in an otherwise normal state, the binary protocol ports were re-enabled, and the previously observed imbalance of client traffic returned.

Upon further investigation it was determined that in the absence of an explicitly configured protocol version, the Node.js driver defaults to version 4 (the default for Cassandra 2.2.6 is v4, the default for 2.1.13 is v3).  When the driver connects to the first contact point, it attempts to make use of this version, if that succeeds, it becomes the version used on subsequent connections, (ignoring contact points that do not support it).  Since restbase1007 sorts first among our nodes, v4 became the de facto version everywhere but 1007 (each node ignores itself as a contact point).  This explained both the imbalance of client traffic, and the service failures when 1007 was restarted (since those instances were the only ones in the connection pools).

2016-06-01T23:57

A change was deployed to set maxVersion to 3 for Cassandra client connections, and client traffic redistributed evenly across the cluster.

Conclusions

There were three separate issues here that slipped by undetected during testing.  In the cases of the broken histogram and the aberrant disk read throughput, it required a production workload and/or access patterns for the problems to manifest.

The issues centering around the selection of driver protocol version could have been caught if the same upgrade work-flow had been applied to staging as was planned for production (i.e. if more complete testing had been carried out after the canary upgrade, but before a complete cluster upgrade).

Actionables

  • Establish staging and test methodologies that more accurately reflect the production environment (T136340)
  • Establish a document of best practices for future upgrades, including (but not limited to):
    • A matching (step-for-step) upgrade process during testing (test the upgrade process as  well as the software)
    • A checklist that includes evaluating native protocol disparities, and updating driver options in advance
    • Establishing ahead of time, a rigorous go no-go gating criteria with rollback strategy
    • Canary upgrades to the stand-by DC, possibly combined with a controlled temporary switch over
  • Continue with investigation of aberrant disk read throughput
    • Attempt to reproduce in staging by generating queries over a larger distribution of records (ex: multiple dump script runs from different title offsets). If successful, integrate this process into standardized testing.
  • Follow up on lack of recency bias in histogram metrics (CASSANDRA-11752)

Footnotes

  1. ↑ https://issues.apache.org/jira/browse/CASSANDRA-11752