Incidents/2019-12-04 MediaWiki

From Wikitech

document status: final

Summary

Roll out of wmf.8 to group1 broke the world.

Detection

Initial indicators of the issue were picked up in logstash and via logspam-watch on mwlog1001. A large number of Icinga alerts followed.

It seems likely that the primary issue was obscured during the initial deploy by a focus on Parsoid errors.

Timeline

All times in UTC.

  • 20:12 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 1) [1]
  • 20:12 Large amounts of logspam noticed, especially from Parsoid/PHP, and Icinga issues many alerts.
  • 20:28 brennen: Train wmf.8 rolled back to just group0 [2]

[Fixes to exclude Parsoid/PHP]

  • 23:30 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 2) [3]
  • 23:30 OUTAGE BEGINS
  • 23:30 Large spike in database errors in logstash (T239877), shortly thereafter large amounts of Icinga alerts go off.
  • 23:30+ Production group1 and group2 wikis become noticably sluggish, eventually stopping working entirely.
  • 23:35 brennen: Attempted train wmf.8 roll back thwarted by canary failures [4]
  • 23:38 brennen: Train wmf.8 rolled back to just group0, again [5]
  • 23:38 OUTAGE ENDS

Links to relevant documentation