Incidents/20150429-Parsoid

From Wikitech

Summary

An attempted Parsoid deploy around 1:08 PT caused the Parsoid cluster to go offline. We reverted the deploy shortly after and the cluster came back online. The outage lasted ~6 mins.

Timeline

  • 1:07 PT As part of regularly scheduled code deploy, Arlo syncs code to cluster and issues restarts
  • 1:08 PT Subbu notices load spikes on the first couple nodes (on ganglia)
  • 1:11 PT Noticing the ganglia spike, we decide to revert deploy
  • 1:12 PT Subbu syncs reverted code
  • 1:13 PT Service restarted
  • ~1:14 PT Cluster back to normal

Analysis

One of the action items from Incident documentation/20150424-Parsoid was to eliminate all mutation from the ParsoidConfig object since it had been the source of 2 corruptions. https://gerrit.wikimedia.org/r/#/c/206487/ was submitted and merged to freeze the Parsoid config. In Beta Cluster and production (unlike rt-testing or dev), we also have statsd metrics reporting enabled. The statds config hangs off the Parsoid config. So, our deep-freeze of parsoid config also freezes the statds config. However, the node-statsd library has an aysnc piece in there that caches the DNS host. In effect, the entire config got frozen before the DNS host caching got triggered (because of the async code in the lib). So, when the DNS host caching in the library executed, the immutability freeze was in effect which raised an exception and prevented Parsoid from starting up.

This effectively prevented Parsoid from starting up on deploy.

Everything was green in development and roundtrip testing. One hour before deploy (11:53 PT), we merged the deploy branch patch which updated the beta cluster. We should have detected this problem in beta cluster since Parsoid was totally down in Beta Cluster as well. Roan Kattouw reports that they got bug reports from QA people complaining VE didn't work on beta but they blamed that on other structural problems with beta and because of the failures they encountered which masked the fact that Parsoid was down.

Clearly, it is insufficient to rely solely on green rt-testing runs -- round trip testing does not use all the same configs (logging, statds metrics reporting, private wikis) as the production cluster.

Action items

  • It is insufficient to rely on indirect Parsoid testing on beta cluster (via VE, Flow, whatever). Right now, this is hard to do because beta cluster Parsoid is not publicly accessible. Till such time, instead of expecting we will be notified of outage reports via VE (or Flow) QA testing, as part of pre-deploy workflow, we need to verify Parsoid functionality via VE edits. Status: Updated workflow to require pre-deploy VE edits. Longer term, we should figure out a way of automating the beta cluster tests so as not to make the pre-deployment workflow onerous.
  • We should figure out a way of automatically determining Parsoid outage in beta cluster.
  • We should find a solution for T92871 which would have found this problem early as well. This requires Parsoid and Release Engineering to figure out what makes sense in the scenario. Status: Work in Progress as part of T92871
  • We should improve our deployment system to automatically perform sanity checks as part of the rolling deploy, and abort if a given % of nodes fail.