Incidents/2022-08-26 Phabricator login issues

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-08-26 Phabricator login issues	Start	2022-08-26 09:56:00
Task	T316337	End	2022-08-26 10:13:00
People paged	0	Responder count	5
Coordinators	Jcrespo	Affected metrics/SLOs	no published SLOs
Impact	For approximately 17 minutes, some users accessing Wikimedia services had their cookies/session headers being removed, causing them to be logged out and unable to log in, effectively making any logged action impossible. It didn't affect MediaWiki, but it affected other services, the one that was most noticeable was Phabricator.

After testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016. ATS layer prevented Phabricator's (and theoretically, other services too, but it was less impacting) session cookies reaching the service's origin server. 474fb2d didn't work as expected because it hit an ATS bug/missdocumented feature. Cookie data was stored in ts.ctx during do_global_post_remap() and restored in do_global_cache_lookup_complete() but for some reason ts.ctx gets wiped in the middle of those two hooks. 474fb2d also missed the step now performed in hide_cookie_store_response().

After the change was reverted, login issues stopped and users were able to perform regular Phabricator logged-in actions.

Timeline

All times in UTC.

[09:56] <vgutierrez> !log testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016 OUTAGE STARTS HERE
[10:07] <dcaro_away> anyone is doing anything with Phabricator? I'm getting logged out after any action. Several other users agree they are logged out and on log-in they get logged out again.
[10:16] Incident opened. Jcrespo becomes IC.
[10:13] <vgutierrez> !log stop testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016 OUTAGE STOPS HERE (but may be affected for longer, as they realize later they are being logged out)
[10:18] <vgutierrez> if RhinosF1 is affected, my experiment is unrelated
[10:23] Jynus reaches out to Hashar (release engineering)
[10:26] People seem to be able to log in again, and stay logged
[10:29] Antoine texts Andre
[10:33] <andre> I don't see any pointers in https://phabricator.wikimedia.org/people/logs/query/all/ , but indeed a lot of folks seem to get stuck at "Login: Partial Login"
[10:35] The issue is identified as traffic-related: <vgutierrez> cp6016 was stripping the phab session cookie and returning a hit for an anonymous user
[10:35] The ticket is created https://phabricator.wikimedia.org/T316337
[10:47] Issue considered resolved. Followups to continue on ticket.

Detection

The issue was quickly pointed out by several people on IRC, in the SRE channel as it impacted highly visible ongoing work of several engineers and volunteers (dcaro, dhinus, jynus, claime).

No alerts where sent because it only affected logged-in users, while the site acted normally for anonymous users.

Conclusions

This probably should be written by the service owner, but in my understanding, a new test was setup on production with an undetected bug, causing traffic disruption (cookie filtering). I don't know if there is much else to do as the same bug is unlikely to hit again; except maybe enabling some kind of automatic monitoring detection of a similar issue.

What went well?

MediaWiki ("the wikis") was not affected
It caused no long-term consequences (no user data lost, no permanent session lost)
Many people communicated quickly and effectively the issue

What went poorly?

There was no monitoring setup to detect the issue
There was a long tail of checking if the issue was corrected after testing finished / it took a bit to understand the source

Links to relevant documentation

Actionables

Improve monitoring to detect ability to log in/perform logged in actions on production Phabricator (?)

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	no
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	no	no pages
	Were pages routed to the correct sub-team(s)?	no	no pages
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no	no pages
Process	Was the incident status section actively updated during the incident?	yes
	Was the public status page updated?	no
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	no
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		8