Incidents/2020-02-06 mediawiki

From Wikitech

document status: in-review


Summary

Widespread service timeouts starting immediately after moving group2 wikis to 1.35.0-wmf.18. Reverting to 1.35.0-wmf.16 immediately brought most services back to life.

A single root cause was responsible for two incidents, this one on Thursday the 6th and again on Friday the 7th. See Incident_documentation/20200207-wikidata for more relevant details on this incident.

Impact

All sites were unresponsive or at least partially offline for ~8 minutes. (Except for cache hits, which would be 'popular' pages for non-logged-in users.)

Detection

~100% of icinga alerts fired simultaneously.

Timeline

All times in UTC.

  • 20:22 or so: start of deploy of wmf.18 to group 2 wikis
  • 20:24 Icinga starts alerting for High average GET latency for mw requests on appservers in eqiad and various other things
  • 20:25 scap completes: <twentyafterfour@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18 refs T233866
  • 20:25 Icinga starts alerting for restbase endpoint health on all restbase servers
  • 20:28 Icinga starts alerting for Varnish HTTP text-frontend health on various cp servers and Apache HTTP on mw servers
  • 20:29 Grafana is not available to European users (and SREs)
  • 20:30 (twentyafterfour@deploy1001) Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 20:30 --force is used to revert the Mediawiki deploy: <twentyafterfour> sync-wikiversions --force
  • 20:31 Icinga recoveries start coming in for Varnish HTTP text-frontend, PHP7 rendering on appservers, etc, but Restbase alerts remain
  • 20:45 Restbase is only reporting errors from wikifeeds (in k8s), restbase is restarted on restbase1016 and restbase1027 but that did not help recoveries
  • 20:47 akosiaris kill all pods for wikifeeds in eqiad. They were in a throttled CPU downward spiral. https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds&fullscreen&panelId=28&from=1581020337234&to=1581022780963
  • 20:52 Restbase alerts recovered

Conclusions

The root cause was a typo in a config setting for Wikibase.

On Jan 16 [1] this config change 565074 was deployed. This would have caused all Wikibase client wikis to read items from the new wb terms store. It did not take effect because of a typo in the name elsewhere in the config files. This means that the default setting in the Wikibase extension was used, with all clients reading from the old store.

On Jan 22 the default setting in the Wikibase extension was changed to have clients all read from the new store. This did not make it into a branch until wmf.18, because of All-Hands. It went live to groups 0 on Feb 4th, and group 1 on Feb 5th. This would have impacted Commons and Wikidata but they do very few Wikibase client reads compared to the bulk of the wikis. The change went live to group 2 on Feb 6th, when we saw the outage.

A similar but smaller scale incident occurred the next day when trying to fix that config variable typo. Friday's incident report: Incident documentation/20200207-wikidata.

The similarity of the two incidents (onset, type of impact, graphs [2], [3]) led us to believe that the root cause of Friday's incident is also the root cause of Thursday's incident. The root cause of Friday's incident was narrowed down to the specific config change because only that single change was deployed at the time of the incident on Friday, which is unlike Thursday when we were routinely deploying a larger batch of changes during the MediaWiki train.

The Monday rollout of wmf.18 to group2 without incident confirms this understanding.

What went well?

  • deployer immediately noticed the deploy was bad and started reverting
  • monitoring detected the incident, many additional people got paged and were online quickly

What went poorly?

  • canary checks did not block the bad deploy
  • scap revert could have been quicker, canary checks blocked the revert at first and revert had to be forced
  • wikifeeds pods had to be killed to fix restbase
  • after this incident, the underlying cause was not at all clear because (as is usual) many changes were rolled out together in the branch deploy

Where did we get lucky?

  • reverting fixed most of the issues and besides wikifeeds pods we did not need additional service restarts

How many people were involved in the remediation?

  • about 8 SRE, 2 DBA, 1 Releng, 2 Performance engineers

Links to relevant documentation

Actionables