Incidents/20160212-AllWikisOutage
Appearance
(Redirected from Incident documentation/20160212-AllWikisOutage)
Summary
While syncing files to backport a logging enhancement to MediaWiki 1.27.0-wmf.13, changes were propagated in the wrong order. This resulted in HHVM fatal errors of
Call to undefined method MediaWiki\Session\SessionManager::checkIpLimits() in /srv/mediawiki/php-1.27.0-wmf.13/includes/Setup.php on line 812
for all requests to all wikis until the updated version of php-1.27.0-wmf.13/includes/session/SessionManager.php was synced to the cluster. The outage lasted approximately 2.5 minutes between 2016-02-12T19:13 to 2016-02-12T19:16.
Timeline
[18:30:05] <jouncebot> bd808 tgr anomie: Dear anthropoid, the time has come. Please deploy Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830).
...
[18:37:20] <bd808> Krenair: all clear on mira?
[18:37:22] <Krenair> bd808, yep
...
[19:12:34] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/DefaultSettings.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 16s)
[19:12:38] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:09] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/Setup.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 18s)
[19:14:34] <paladox> wikipedia has gone down for me https://en.wikipedia.org/
[19:14:36] <bd808> shit. synced in wrong order
[19:14:41] <paladox> Request from 10.20.0.104 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 1730353932
[19:14:41] <paladox> Forwarded for: 81.140.246.2, 10.20.0.104, 10.20.0.104, 10.20.0.104
[19:14:41] <paladox> Error: 503, Service Unavailable at Fri, 12 Feb 2016 19:14:22 GMT
[19:14:44] <sjoerddebruin> 503's yep
[19:14:47] <apergos> wikitech empty main page. er?
[19:14:48] <bd808> will be fixed in 2 minutes
[19:14:49] <apergos> anyways
[19:15:04] <gwicke> uh oh, api is throwing lots of 503s
[19:15:12] <bd808> !log Synced files for T125455 in wrong order; broke all wikis
[19:15:26] <bd808> the fix is syncing now :/
[19:15:44] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/session/SessionManager.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (T125455) (duration: 01m 17s)
[19:15:47] <bd808> better?
[19:15:58] <gwicke> bd808: back for me
[19:16:17] <paladox> its back up now.
[19:16:26] <paladox> Thanks for fixing the problem.
[19:16:28] <bd808> sorry everyone. brain fart from me
[19:16:35] <Krenair> woah
[19:16:39] <gwicke> we really ought to stop breaking everything at once
[19:16:55] <bd808> !log Wikis back up thankfully
[19:16:58] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
Conclusions
- Entirely operator error. The deployer should have understood how the changes were interrelated and performed the sync of SessionManager.php before Setup.php.
- Having the
sync-file
statements prepared ahead of time in a text document allowed quick action to sync the missing file.
Actionables
- Use a less risky deployment process. Except for emergencies, always deploy to a canary first, followed by a rolling deploy. Ideally, have a mechanism to automatically detect errors & abort an ongoing deploy. phab:T121597