Incident documentation/20160212-AllWikisOutage

From Wikitech
Jump to: navigation, search

Summary

While syncing files to backport a logging enhancement to MediaWiki 1.27.0-wmf.13, changes were propagated in the wrong order. This resulted in HHVM fatal errors of

Call to undefined method MediaWiki\Session\SessionManager::checkIpLimits() in /srv/mediawiki/php-1.27.0-wmf.13/includes/Setup.php on line 812

for all requests to all wikis until the updated version of php-1.27.0-wmf.13/includes/session/SessionManager.php was synced to the cluster. The outage lasted approximately 2.5 minutes between 2016-02-12T19:13 to 2016-02-12T19:16.

Timeline

[18:30:05] <jouncebot>	 bd808 tgr anomie: Dear anthropoid, the time has come. Please deploy Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830).
...
[18:37:20] <bd808>	 Krenair: all clear on mira?
[18:37:22] <Krenair>	 bd808, yep
...
[19:12:34] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/DefaultSettings.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 16s)
[19:12:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:09] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/Setup.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 18s)
[19:14:34] <paladox>	 wikipedia has gone down for me https://en.wikipedia.org/
[19:14:36] <bd808>	 shit. synced in wrong order
[19:14:41] <paladox>	 Request from 10.20.0.104 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 1730353932
[19:14:41] <paladox>	 Forwarded for: 81.140.246.2, 10.20.0.104, 10.20.0.104, 10.20.0.104
[19:14:41] <paladox>	 Error: 503, Service Unavailable at Fri, 12 Feb 2016 19:14:22 GMT 
[19:14:44] <sjoerddebruin>	 503's yep
[19:14:47] <apergos>	 wikitech empty main page. er?
[19:14:48] <bd808>	 will be fixed in 2 minutes
[19:14:49] <apergos>	 anyways
[19:15:04] <gwicke>	 uh oh, api is throwing lots of 503s
[19:15:12] <bd808>	 !log Synced files for T125455 in wrong order; broke all wikis
[19:15:26] <bd808>	 the fix is syncing now :/
[19:15:44] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/session/SessionManager.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (T125455) (duration: 01m 17s)
[19:15:47] <bd808>	 better?
[19:15:58] <gwicke>	 bd808: back for me
[19:16:17] <paladox>	 its back up now.
[19:16:26] <paladox>	 Thanks for fixing the problem.
[19:16:28] <bd808>	 sorry everyone. brain fart from me
[19:16:35] <Krenair>	 woah
[19:16:39] <gwicke>	 we really ought to stop breaking everything at once
[19:16:55] <bd808>	 !log Wikis back up thankfully
[19:16:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master

Conclusions

  • Entirely operator error. The deployer should have understood how the changes were interrelated and performed the sync of SessionManager.php before Setup.php.
  • Having the sync-file statements prepared ahead of time in a text document allowed quick action to sync the missing file.

Actionables

  • Use a less risky deployment process. Except for emergencies, always deploy to a canary first, followed by a rolling deploy. Ideally, have a mechanism to automatically detect errors & abort an ongoing deploy. phab:T121597