Talk:Incidents/20160123-SessionManagerRolloutFailure

Thanks for writing that up, Bryan, it's a very thorough summary! Some ideas for actionables (apart from the ones already mentioned):

make debugging on mw1017 easier - apart from Ori's suggestion, there is T117020
make that debug log available from logstash (with a short TTL since the volume is huge) and add some field to it which makes it easy to identify requests (e.g. user agent, which can be set to a custom value)
start some test documentation for CentralAuth - there are some branches that are hard to guess (e.g. how it can behave significantly differently depending on the browser's handling of third-party cookies)
figure out how mobile apps can be tested on time (the Android app can visit testwiki, but can it log in on it? and of course beta would be more ideal, and what about the iPhone app?)

--tgr (talk) 06:03, 27 January 2016 (UTC)Reply

* make that debug log available from logstash (with a short TTL since the volume is huge) -- having a different TTL for a given log group would take a special setup in logstash and the kibana front end. Not impossible but not trivial either. The technical reason being that Elasticsearch index fragments are write-only and thus deletes involve flagging documents as deleted and waiting until a significant fraction of the documents in a given index fragment are marked for deletion before actually rewriting the fragment to a new one that excludes the deleted docs and discarding the old one. We put all log events for each day in a single index regardless of log channel and expire logs by dropping entire indices. To cleanly have a different expiration we would need to populate and manage a separate collection of indices for the shorter duration events. --BryanDavis (talk) 18:50, 30 January 2016 (UTC)Reply

Structuring the document

Latest comment: 8 years ago1 comment1 person in discussion

As Gergo said, great writeup Bryan! There's a lot we can learn from this.

A low-priority fixme for someone (possibly me) would be to structure this document to make it more suited to collaborative editing and for organized reading. The current doc is in storytelling form, where a couple of bulleted lists would be useful: "what went well" and "what we should look into". A low-priority phuture Phab ticket phor me: write up retrospective/incident-document/postmortem best practices doc. -- RobLa (talk) 02:28, 5 February 2016 (UTC)Reply

Credit where credit is due

Latest comment: 8 years ago1 comment1 person in discussion

The very detailed list of events was in large part due to help from @Aklapper: who was a great collaborator on the Etherpad. --BryanDavis (talk) 03:28, 5 February 2016 (UTC)Reply

Etherpad archive

From https://etherpad.wikimedia.org/p/SessionManagerRolloutFailure

== What ==

Lots and lots of problems related to cookie handling and the SessionManager code changes.

The change https://gerrit.wikimedia.org/r/#/c/243223/ introduce a manager for sessions and got released with wmf/1.27.0-wmf.11

* {{done}} OAuth granting access to local wiki account with uid matching CentralAuth user due to configuration for central id not being loaded early enough
* Logout not working always; some users reported automatic login firing on the logout screen
** If Gergő's experience was typical, it could probably be fixed by removing the "if ( isset( $metadata['CentralAuthSource'] ) ) {" bit from https://gerrit.wikimedia.org/r/#/c/265749/2/includes/session/CentralAuthSessionProvider.php actually, probably not fixable that way.
*** I'm not sure anyone else reported this; the other reports were before your patch. Probably can't happen anymore (old sessions have been purged), but a sanity check wouldn't hurt nevertheless.
* need_token errors
** Technically, it's clients that have always had broken cookie handling.
** (At least) two groups of clients - the first do not check expiration and send back 'deleted' (we have a workaround now); the other don't send back any cookie, I don't think we have a hypothesis so far for how they get it wrong
*** There's a mention on [[en:Wikipedia:Bot owners' noticeboard]] that someone had endless NeedToken responses because they were using http rather than https, and switching to https fixed it.
**** Did we start setting the secure flag on cookies that previously were not using it? (YES!)
**** Brad is testing with http and his bot against mw.o and seems to not be getting an https redirect of api.php for a POST request
* wikisource.org login failures
** Hard to debug without wmf.11 on wikisource...
*** Tracked to cookies getting set for mutiple domains on login and only only one was getting cleared on logout (Gergo knows more)
* 2016-01-22T13:43 Johan reports the error "CAS update failed on user_touched" trying to use MassMessage in https://phabricator.wikimedia.org/T124414
** 2016-01-22T19:00 Anomie merges "User: Fix loading of user_touched" into Core in https://gerrit.wikimedia.org/r/#/c/265776/ to fix T124414
* TODO:RELATED? https://phabricator.wikimedia.org/T124382 related to https://phabricator.wikimedia.org/T124384 (Mobile App)?
* TODO:RELATED? https://phabricator.wikimedia.org/T124414 probably also triggered https://phabricator.wikimedia.org/T124441 in MassMessage?
** Yes, it's the same error.
** Looks like it, based on T124441#1956525. Which means it's fixed by the same patch that fixed T124414.
* TODO:RELATED? 2016-01-22T14:06 Samtar forwards English Wikipedia and OTRS reports in https://phabricator.wikimedia.org/T124417 that connections to Wikimedia projects slow/time out for some users
** Unlikely, since https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Problem_with_loading_Wikimedia_sites was begun on Jan 16.
* 2016-01-23T00:48 bd808 reports in https://phabricator.wikimedia.org/T124510 that the page preview response is not being cached for history navigation
** Found to be caused by removal of session_cache_limiter() calls
* TODO: Deps of https://phabricator.wikimedia.org/T123451
** Most of the ones still open are listed above.
* Sunday report "New accounts not autocreated on loginwiki" in https://phabricator.wikimedia.org/T124601 duplicate of https://phabricator.wikimedia.org/T124468
* wikisource.org login issue split by Tgr into seperate https://phabricator.wikimedia.org/T124620
* The resetGlobalUserTokens.php script is really slow.
* Users were being autocreated unattached to their CentralAuth accounts: https://phabricator.wikimedia.org/T74791#1953065


Other things going on at the same time on 2016-01-22:
* HTCP purge flood
** Investigated on Jan 21. Seems to be due to Wikidata cron jobs on terbium that triggers a wide range of cdnPurges and htmlCacheUpdate jobs.
** https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes
* rolling restarts for kernel upgrades
* pc1001 maintenance (only a depool and repool, maintenance was not done)

== Summary ==
MediaWiki 1.27.0-wmf.11 included the SessionManager component and related work from the AuthManager project. This set of work changed most of the PHP session handling code for MediaWiki. The code went through thorough code and security review and as much testing on the beta cluster as could be done (merged right after the prior branch cut). A few small issues were spotted early after deployment to the group0 testing wikis but the big scary issues were not found or reported until the code started getting wider usage on the group1 and group2 wikis.

The long tail of issues seem to be mostly related to cookie handling by various user agents. There were several changes to cache and cookie headers which seemed trivially better in code review that turned out to be problematic when intereacting with the very large and diverse set of user agents which visit the Wikimedia projects. Some examples:
* PHP sending a "Pragma: no-cache" header during session creation <https://phabricator.wikimedia.org/T124510>
* "secure" flag being added to cookies returned for non-TLS requests (at least part of <https://phabricator.wikimedia.org/T124252>)

Not having TLS on the beta cluster makes it unlikely that we would have found several of these before group0.
Many of the issues didn't show up in production until .11 hit group1 and login wiki.
Setting up mw1017 to run .11 after the .10 rollback has helped in debugging immensely. Being able to run mw1017 from master would be even more awesome.

== Bugs ==
* {{phabT|124126}}: Can neither load the session nor create an empty session 
* {{phabT|124143}}: Infinite recursion between user and session loading
* {{phabT|124224}}: I'm editing in Widar under some else his/her account using oauth
* {{phabT|124252}}: NEED_TOKEN error spike when 1.27-wmf.11 SessionManager was deployed to group1
* {{phabT|124384}}: [Panic] Login broken in app?
* {{phabT|124428}}: SessionHandler causing login problems for Huggle 2 and older
* {{phabT|124406}}: s7 master is executing 54K transactions/s
* {{phabT|124409}}: Logging out immediately logs you back in
* {{phabT|124440}}: Invalidate all users sessions
* {{phabT|124451}}: Don't add claims if a logged in user gets logged out
* {{phabT|124453}}: Meta login fails
* {{phabT|124620}}: SessionManager breaks cookie support check in wikisource.org login
* {{phabT|124468}}: CentralAuth is not logging the user into loginwiki anymore
* {{phabT|124510}}: Page preview response not being cached for history navigation in 1.27.0-wmf11
* {{phabT|124414}}: MassMessage Fatal error: CAS update failed on user_touched
* {{phabT|124335}}: bot_passwords table expected to exist despite wgEnableBotPasswords = false
* {{phabT|74791}}: Newly created accounts are not global (at least, according to OAuth)
* {{phabT|124821}}: SessionManager: unexpectedly logged out after visiting another wiki

* {{phabT|124367}}: User::loadFromSession called before the end of Setup.php
* {{phabT|124371}}: Clean up usage of $_SESSION in WMF-deployed extensions

== Timeline ==

* 2016-01-19T23:18 group0 to 1.27.0-wmf.11
* 2016-01-20T01:20 bd808 files https://phabricator.wikimedia.org/T124126 about trending "Can neither load the session nor create an empty session" errors in fatalmonitor
* 2016-01-20T07:26 Tgr files https://phabricator.wikimedia.org/T124143 "Infinite recursion between user and session loading"
* 2016-01-20T19:32 group1 wikis to 1.27.0-wmf.11
* 2016-01-20T19:41 <icinga-wm>     PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=1453316575736&to=1453328979345
* 2016-01-20T20:49 jynus communicates the issue to the last deployer after investigating timing and suggests reverting the change
* 2016-01-20T21:30 reverted group1 wikis to 1.27.0-wmf.10 due to session errors
* 2016-01-20T22:06 anomie syncs https://gerrit.wikimedia.org/r/#/c/265400/ to fix https://phabricator.wikimedia.org/T124224 (OAuth & CentralAuth issue)
* 2016-01-20T22:38 Tgr syncs fixes for T124143 (infiinite recursion) and T124126 (can neither load nor create session)
* 2016-01-20T23:53 Tgr reports in https://phabricator.wikimedia.org/T124252 a steep increase of NEED_TOKEN errors after deploying .11 to group1 
* 2016-01-21T18:57 group2 wikis to 1.27.0-wmf.11
* 2016-01-22T01:48 Krenair files https://phabricator.wikimedia.org/T124380 about mw2020 prompting for password during ssh (only relevant to .10 rollback issues later)
* 2016-01-22T02:56 DBrant reports in https://phabricator.wikimedia.org/T124384 that login via the Wikipedia Android App is not possible
* 2016-01-22T05:16 TimStarling reports mw2020 not responding to ssh during sync-file (only relevant to .10 rollback issues later)
* 2016-01-22T07:09 _joe_ points to kernel restart as likely trigger of mw2020 problems (only relevant to .10 rollback issues later)
* 2016-01-22T07:34 _joe_ reboots mw2020 to finish reinstall triggered by PXE boot when machine was restarted  (only relevant to .10 rollback issues later)
* 2016-01-22T10:02 paravoid asks for someone to look into disk usage on fluorine
* 2016-01-22T10:05 _joe_ finds 109G of session.log on fluorine along with other large logs from CirrusSearch
* 2016-01-22T11:13 jynus files unbreak now https://phabricator.wikimedia.org/T124406 about greatly increased traffic to S7 master
* 2016-01-22T11:25 _joe_ disables "session" log channel with https://gerrit.wikimedia.org/r/#/c/265706/
* 2016-01-22T12:18 Kusma forwards reports from English Wikipedia by creating https://phabricator.wikimedia.org/T124409 ("Logging out immediately logs you back in")
* 2016-01-22T12:39 Reedy alerts people to https://phabricator.wikimedia.org/T124409 (Logging out immediately logs you back in) and suggests .10 revert
* 2016-01-22T12:46 bblack leaves anomie a voicemail about logout issue
* 2016-01-22T12:48 anomie jumps on irc to find out what's going on and is pointed to T124409
* 2016-01-22T12:53 anomie notices that session logs are not going to logstash
* 2016-01-22T12:58 Reedy again suggests reverting to .10 if fixes can't be found for auth related issues
* 2016-01-22T13:02 anomie workes with dcausse to get traces of http activity reproducing logout failure
* 2016-01-22T13:05 _joe_ makes config change and ori syncs to put session logs back into logstash for anomie
* 2016-01-22T13:09 anomie suspects that dcausse is describing login caused by the CentralAuth "check if I'm logged in centrally" JS
* 2016-01-22T13:45 anomie reproduces logout bug himself
* 2016-01-22T14:13 ori asks if there is any reason not to rollback to .10 and maybe just leave group0 on .11
* 2016-01-22T16:06 Anomie live hacks mw1017 (testwiki) to test fix for T124409
* 2016-01-22T16:39 Anomie merges the "SessionManager: Add SessionBackend::setProviderMetadata()" Core fix for T124409 in https://gerrit.wikimedia.org/r/#/c/265750/
* 2016-01-22T16:43 Anomie merges the "SessionManager: Track whether the session is supposed to be CA or Local" CentralAuth fix for T124409 in https://gerrit.wikimedia.org/r/#/c/265751/
* 2016-01-22T17:10 Reedy creates https://phabricator.wikimedia.org/T124440 to request invalidating all users sessions
* 2016-01-22T17:29 anomie syncs fix from bd808 for T124406 (s7 master traffic)
* 2016-01-22T17:51 Niedzelski merges the "Remove expired cookies" Wikipedia Android app fix in https://gerrit.wikimedia.org/r/#/c/265685/ to fix T124384
* 2016-01-22T18:15 Legoktm starts running CentralAuth's resetGlobalUserTokens.php to force session resets for all users for https://phabricator.wikimedia.org/T124440
* 2016-01-22T18:38 Multichill reports in https://phabricator.wikimedia.org/T124451 that his edits were done logged out though being displayed as logged in
* 2016-01-22T18:42 IKhitron reports in https://phabricator.wikimedia.org/T124453 that login on meta fails; Tgr later states that he can only reproduce on wikisource.org
* 2016-01-22T19:54 Anomie reports that CentralAuth is not logging the user into loginwiki anymore in https://phabricator.wikimedia.org/T124468 as a regression created by the hotfixes for https://phabricator.wikimedia.org/T124409 ("Logging out immediately logs you back in") 
* 2016-01-22T21:46 Anomie merges the "SessionManager: Kill getPersistedSessionId()" Core patch in https://gerrit.wikimedia.org/r/#/c/265840/ to fix the loginwiki regression in T124468
SessionManager: Kill getPersistedSessionId()")
* 2016-01-22T22:16 Twentyafterfour runs full scap for https://gerrit.wikimedia.org/r/#/c/263415/ (MWVersions cleanup)
* 2016-01-22T23:47 bd808 merges "Ignore auth cookies with value 'deleted'" CentralAuth and Core patches for NEED_TOKEN errors in T124252: https://gerrit.wikimedia.org/r/#/c/265870/ https://gerrit.wikimedia.org/r/#/c/265869/
* 2016-01-23T00:00 bd808 merges "Only delete cookies which are actually set" CentralAuth and Core patches for NEED_TOKEN errors in T124252: https://gerrit.wikimedia.org/r/#/c/265872/ https://gerrit.wikimedia.org/r/#/c/265871/
* 2016-01-23T00:48 bd808 files https://phabricator.wikimedia.org/T124510 based on reports of local browser cache behavior change on enwiki village pump
* 2016-01-23T00:53 Tgr reports that he has reproduced the CentralAuth automatic login on the logout page issue
* 2016-01-23T01:02 Consensus reached to rollback rather than worry about logout not working over the weekend
* 2016-01-23T01:16 bd808 Reverted all wikis back to 1.27.0-wmf.10
* 2016-01-23T01:25 bd808 Noticed l10n errors in logstash fatal monitor
* 2016-01-23T01:28 bd808 Re-reverted to 1.27.0-wmf.11
* 2016-01-23T01:29 bd808 Determined that all l10n errors were from mw2020.codfw
* 2016-01-23T01:30 bd808 Ran sync-common on mw2020.codfw
* 2016-01-23T01:33 bd808 Re-re-reverted to 1.27.0-wmf.10