Incidents/2019-12-04 MediaWiki

document status: final

Summary

Roll out of wmf.8 to group1 broke the world.

Initial indicators of the issue were picked up in logstash and via logspam-watch on mwlog1001. A large number of Icinga alerts followed.

It seems likely that the primary issue was obscured during the initial deploy by a focus on Parsoid errors.

All times in UTC.

20:12 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 1) [1]
20:12 Large amounts of logspam noticed, especially from Parsoid/PHP, and Icinga issues many alerts.
20:28 brennen: Train wmf.8 rolled back to just group0 [2]

[Fixes to exclude Parsoid/PHP]

23:30 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 2) [3]
23:30 OUTAGE BEGINS
23:30 Large spike in database errors in logstash (T239877), shortly thereafter large amounts of Icinga alerts go off.
23:30+ Production group1 and group2 wikis become noticably sluggish, eventually stopping working entirely.
23:35 brennen: Attempted train wmf.8 roll back thwarted by canary failures [4]
23:38 brennen: Train wmf.8 rolled back to just group0, again [5]
23:38 OUTAGE ENDS