While syncing files to backport a logging enhancement to MediaWiki 1.27.0-wmf.13, changes were propagated in the wrong order. This resulted in HHVM fatal errors of
Call to undefined method MediaWiki\Session\SessionManager::checkIpLimits() in /srv/mediawiki/php-1.27.0-wmf.13/includes/Setup.php on line 812
for all requests to all wikis until the updated version of php-1.27.0-wmf.13/includes/session/SessionManager.php was synced to the cluster. The outage lasted approximately 2.5 minutes between 2016-02-12T19:13 to 2016-02-12T19:16.
[18:30:05] <jouncebot> bd808 tgr anomie: Dear anthropoid, the time has come. Please deploy Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830). ... [18:37:20] <bd808> Krenair: all clear on mira? [18:37:22] <Krenair> bd808, yep ... [19:12:34] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/DefaultSettings.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 16s) [19:12:38] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:09] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/Setup.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 18s) [19:14:34] <paladox> wikipedia has gone down for me https://en.wikipedia.org/ [19:14:36] <bd808> shit. synced in wrong order [19:14:41] <paladox> Request from 10.20.0.104 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 1730353932 [19:14:41] <paladox> Forwarded for: 18.104.22.168, 10.20.0.104, 10.20.0.104, 10.20.0.104 [19:14:41] <paladox> Error: 503, Service Unavailable at Fri, 12 Feb 2016 19:14:22 GMT [19:14:44] <sjoerddebruin> 503's yep [19:14:47] <apergos> wikitech empty main page. er? [19:14:48] <bd808> will be fixed in 2 minutes [19:14:49] <apergos> anyways [19:15:04] <gwicke> uh oh, api is throwing lots of 503s [19:15:12] <bd808> !log Synced files for T125455 in wrong order; broke all wikis [19:15:26] <bd808> the fix is syncing now :/ [19:15:44] <logmsgbot> !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/session/SessionManager.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (T125455) (duration: 01m 17s) [19:15:47] <bd808> better? [19:15:58] <gwicke> bd808: back for me [19:16:17] <paladox> its back up now. [19:16:26] <paladox> Thanks for fixing the problem. [19:16:28] <bd808> sorry everyone. brain fart from me [19:16:35] <Krenair> woah [19:16:39] <gwicke> we really ought to stop breaking everything at once [19:16:55] <bd808> !log Wikis back up thankfully [19:16:58] <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
- Entirely operator error. The deployer should have understood how the changes were interrelated and performed the sync of SessionManager.php before Setup.php.
- Having the
sync-filestatements prepared ahead of time in a text document allowed quick action to sync the missing file.
- Use a less risky deployment process. Except for emergencies, always deploy to a canary first, followed by a rolling deploy. Ideally, have a mechanism to automatically detect errors & abort an ongoing deploy. phab:T121597