Incidents/20160608-SCB

Summary

After a planned, but uncoordinated Zookeeper restart whose purpose was to increase the maximum number of allowed connections, the Zookeeper client in the ChangeProp service reconnected at an excessively high rate. ChangeProp also processed Kafka events in the backlog at an elevated rate, which led to a user-facing outage affecting services sharing the same hardware.

After checking the logs and discussing with several ops in charge of traffic, the user facing issues observed on the logs were minimal due to: a) the cache layer serving most of the queries to non-logged in users (~90% of the total traffic) b) Only certain, and not all, logged users' actions being potentially affected: visual editor editions, graphoid users, etc. c) mobile users either using or failing back to the MW API.

Timeline

12:59 - https://gerrit.wikimedia.org/r/293298 gets merged and all nodes in the Zookeeper cluster are restarted by Puppet.

13:09 - Icinga alert regarding scb services, affecting citoid, the mobile content service, graphoid, mathoid & cxserver.

13:18 - Powercycling scb1001

13:24 - Incinga recovery on scb services

13:26 - Powercycling scb1002

13:29 - Change-prop stopped on scb1001

13:30 - Change-prop stopped on scb1002

5xx response rates from RESTBase during the outage

Conclusions

The node kafka driver is not very robust. While this was known, we should accelerate the search for alternatives, possibly in the form of a decent binding to librdkafka.
We need to be more thorough about testing failure handling code. This outage revealed several issues in how the node kafka driver handles connection errors, which would have likely been discoverable in a test environment. Tools like ChaosMonkey can help by injecting failures in test environments & production.
Isolation between services on shared service clusters is not sufficient. A mis-behaving service can cause an outage for other services on the same hardware by consuming excessive RAM or CPU.

Actionables

Kafka-node: Rate-limit Zookeeper reconnect attempts in the current kafka client (Status: Done)
Zookeeper conf: limit the number of client connections to 1024 (Status: Done)
Change Prop: limit message concurrency to 30 down from 100 (Status: Done)
Change Prop: start only 8 workers per node (Status: Done)
Kafka-node: Avoid opening new Kafka connections on each Zookeeper connection retry (Status: Done)
Puppet: Avoid uncoordinated Zookeeper restarts (Status: Done; manual restarts for now)
Improve isolation in scb cluster. Use cgroups / containers?
Replace the node-kafka client with a binding to librdkafka (Status: Todo)
Create general guidelines & processes to ensure thorough fault testing of services, both pre-deployment & later in production. (Status: Todo)