Incidents/2019-06-06 wikibase

Summary

A broken config change briefly caused wikis with Wikibase Repository enabled (chiefly Wikidata and Commons) to be broken on the canary hosts (10% of hosts).

Impact

Wikis: Wikidata, Wikimedia Commons, Test Wikidata, Test Wikimedia Commons
Functionality: all of MediaWiki, at least on Commons (assertion failure during initialization); unclear for Wikidata
Users: whoever got routed to the canary hosts (10% of hosts)
Duration: ca. 10 minutes

See also T225212 for an overview of the errors that occurred; since it’s not yet known where the “undefined variable” errors came from nor what effect they had, the impact on Wikidata is unclear.

Detection

scap automatically aborted the sync after detecting the high error rate on the canaries. A revert was manually created and synced afterwards.

Timeline

All times in UTC.

11:54:38 scap finished sync-pull-masters and started sync-check-canaries
11:54:39 scap finished Canaries Synced OUTAGE BEGINS
11:54:59 scap check 'Logstash error rate for mw….eqiad.wmnet' failed, 97% OVER_THRESHOLD
11:54:59 scap failed: average error rate on 8/11 canaries increased by 10x
11:58:48 revert committed on deploy1001
12:01 revert uploaded to Gerrit
12:04:22 scap finished Canaries Synced OUTAGE ENDS
12:04:50 scap finished

Conclusions

What went well?

The canaries detected the error just as they’re supposed to, and full deployment was stopped.

What went poorly?

The error was not discovered during testing on mwdebug1002.
Deployer (Lucas Werkmeister (WMDE)) was not aware that the canary hosts would not be fixed until the second sync, and therefore delayed the fix by moving the commit through Gerrit first

Where did we get lucky?

The errors did not cause the monitoring requests to fail (from Swagger, and from PyBal) because those only target enwiki. It was caught by the part of the Scap canary checker that queries Logstash for a change in overall error levels.
The rate of errors from these two wikis was sufficiently high to stand out from the on-going fluctuation of errors we have normally, which allowed Scap's Logstash query to detect the difference.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Scap3/Migration Guide#Canary hosts and checks is the only documentation for the canary hosts that deployer (Lucas Werkmeister (WMDE)) is aware of
WMDE/Wikidata/Deployment#Configuration changes

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

Clarify on SWAT deploys/Deployers that Push code live BEFORE pushing patches to Gerrit still applies even if the scap sync was automatically aborted (done)
Enable scap to automatically roll back changes in MediaWiki (according to Reedy it already supports this for “services and stuff”) (T225207)
Investigate the errors and try to deploy the change again once they’re fixed (T225212)
…