Incidents/20150519-RESTBase

Summary

The deployment of https://gerrit.wikimedia.org/r/#/c/198433/ brought down the RESTBase service for around 30 minutes, due to the parallel execution of Puppet agents across the RESTBase eqiad cluster. During this period, all of RESTBase's users were affected - VE, OCG, CXServer.

Timeline

2015-05-19 13:08: mobrovac thinks he disabled Puppet on restbase100x
2015-05-19 13:09: godog merges https://gerrit.wikimedia.org/r/#/c/198433/
2015-05-19 13:11: mobrovac notices RESTBase cannot start up on restbase1001 due to Cassandra time-outs in select queries
2015-05-19 13:16: mobrovac retries with a modified config (with simplified consistency), but no change
2015-05-19 13:20: icinga reports RESTBase's unavailability
2015-05-19 13:25: _joe_ joins the investigation
2015-05-19 13:36: mobrovac realises he didn't really disable Puppet, so all nodes were fighting to create the new keyspaces
2015-05-19 13:39: godog reverts https://gerrit.wikimedia.org/r/#/c/198433/ (in https://gerrit.wikimedia.org/r/211982)
2015-05-19 13:42: _joe_ runs and then disables Puppet on the nodes, RESTBase is now up
2015-05-19 14:11: mobrovac removes all of the offending keyspaces manually from Cassandra
2015-05-19 14:15: mobrovac starts one RESTBase instance manually which creates the keyspaces correctly
2015-05-19 14:33: _joe_ re-merges the patch (as https://gerrit.wikimedia.org/r/211993)
2015-05-19 14:36: _joe_ disables Puppet and runs it sequentially on each node, _joe_ and mobrovac monitor the progress
2015-05-19 14:43: the new configuration is applied on all nodes and RESTBase and Cassandra are up and running this new version

Conclusions

This particular patch required the creation of new keyspaces in Cassandra, which does not handle parallel keyspace and table creation well. As all of the RESTBase services were restarted automatically by Puppet almost at the same time, they were fighting for this creation process, effectively rendering the cluster unavailable. Thus, when there is such a code/config change, gradual application of changes, rolling restarts and periodic checks of the previous steps are imperative.

Actionables

Short-term:
- Do not allow Puppet to automatically restart RESTBase (https://gerrit.wikimedia.org/r/#/c/212016/)
- On a config-change deploy, wait for Puppet to install the new version, and then run a second RESTBase instance which brings Cassandra to the desired state
Long-term:
- Streamline service deployment, specifically automating the rolling change application and restart (bug T93428)