Incidents/2019-06-06 wikibase

From Wikitech

Summary

A broken config change briefly caused wikis with Wikibase Repository enabled (chiefly Wikidata and Commons) to be broken on the canary hosts (10% of hosts).

Impact

  • Wikis: Wikidata, Wikimedia Commons, Test Wikidata, Test Wikimedia Commons
  • Functionality: all of MediaWiki, at least on Commons (assertion failure during initialization); unclear for Wikidata
  • Users: whoever got routed to the canary hosts (10% of hosts)
  • Duration: ca. 10 minutes

See also T225212 for an overview of the errors that occurred; since it’s not yet known where the “undefined variable” errors came from nor what effect they had, the impact on Wikidata is unclear.

Detection

scap automatically aborted the sync after detecting the high error rate on the canaries. A revert was manually created and synced afterwards.

Timeline

All times in UTC.

  • 11:54:38 scap finished sync-pull-masters and started sync-check-canaries
  • 11:54:39 scap finished Canaries Synced OUTAGE BEGINS
  • 11:54:59 scap check 'Logstash error rate for mw….eqiad.wmnet' failed, 97% OVER_THRESHOLD
  • 11:54:59 scap failed: average error rate on 8/11 canaries increased by 10x
  • 11:58:48 revert committed on deploy1001
  • 12:01 revert uploaded to Gerrit
  • 12:04:22 scap finished Canaries Synced OUTAGE ENDS
  • 12:04:50 scap finished

Conclusions

What went well?

  • The canaries detected the error just as they’re supposed to, and full deployment was stopped.

What went poorly?

  • The error was not discovered during testing on mwdebug1002.
  • Deployer (Lucas Werkmeister (WMDE)) was not aware that the canary hosts would not be fixed until the second sync, and therefore delayed the fix by moving the commit through Gerrit first

Where did we get lucky?

  • The errors did not cause the monitoring requests to fail (from Swagger, and from PyBal) because those only target enwiki. It was caught by the part of the Scap canary checker that queries Logstash for a change in overall error levels.
  • The rate of errors from these two wikis was sufficiently high to stand out from the on-going fluctuation of errors we have normally, which allowed Scap's Logstash query to detect the difference.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

  • Clarify on SWAT deploys/Deployers that Push code live BEFORE pushing patches to Gerrit still applies even if the scap sync was automatically aborted (done)
  • Enable scap to automatically roll back changes in MediaWiki (according to Reedy it already supports this for “services and stuff”) (T225207)
  • Investigate the errors and try to deploy the change again once they’re fixed (T225212)