Incidents/20160712-EchoCentralAuth/Retrospective

ACTIONS!

Greg- first step locate the owners list that we believe already exists (and update it)
Give a robust definition of "ownership"
- Matt: Every component should have an owner, even if that doesn't mean being an owner means being committed to making everything nice. Being an owner could mean that one is simply on the hook if / when severe issues arise. If we define ownership that way, it may be possible to have adequate coverage of all components.
- If we adopt this def'n, then we may want to only consider the owners list "done" when all critical components have at least N (N>=1) owners
- Is an owner a person or a team?
  - Persons move on (for example, there are things that list Chris S. as owner, but he is no longer at the WMF; "Security" may have been a better owner)
  - OTOH, sometimes the "owner" is the person who has the deep misfortune of being most expert.
    - We should recognize that ownership is sometimes a property of experience / knowledge rather than scope. But we should probably also confer special recognition / appreciation when we ask someone to do work on something that falls outside the parameters of their role.
  - Sometimes the owner will be a volunteer, is that sufficient wrt turn-around time?
- bd808: "Who looks at bugs and decides their importance" is more important than "who writes code to fix things", this incident shows that RelEng should own it (own ownership).
- There is mw:Developers/Maintainers . Someone said there are redundant ones, though.
  - https://docs.google.com/spreadsheets/d/1e25O69JxLPYBrwunDDLFGM80eFQuTzghY0EsvvgQR0s/edit#gid=0 , mw:Reading/Component_responsibility
- RelEng needs help "componentizing" the list (enumerating all critical subsystems)
  - Roan recommends James F to help make a first pass
  - Maybe we need two levels of ownership, one level that involves fixing the most severe issues (or taking leadership to ensure they are fixed), and another level (with fewer components) that you keep up to a higher quality standard.

bd808: No one owns CentralAuth. Some people may be under the impression that Reading Infrastructure owns it, but this is not true, and not consistent with Reading Infrastructure's self-understanding.
- Ori: Don't rewrite large swaths of a subsystem unless you have plans to maintain it?
  - Project started before "re-org of doom", Reading Infra faced unappetizing choice of either dropping the project altogether (and forfeiting all of the work that had already been put into it) or forging ahead even without a clear ownership picture.

Greg: will start doing a weekly (on Monday) review of all UBN! tasks
- see also: https://phabricator.wikimedia.org/T140207

Ori to create business metrics task https://phabricator.wikimedia.org/T140942
Adam/Gergo/Brad should discuss if even more instrumentation is needed (maybe not, maybe so) for this particular sort of issue, and what it might look like
- https://grafana.wikimedia.org/dashboard/db/authentication-metrics has login counts but this incident didn't make a dent
  - but as bd808 points out, this is hugely skewed by bots
    - not that badly if you filter to the web IF
- something is seriously wrong if thousands of failed logins per day don't make a dent
  - 1 failed login != 1 missing successful login; we have about 5-10k successful logins per day. We log failures but do not catch and and log exceptions in the login process (so it did not show up in failure counts). That could be improved I guess
    - Thanks

What happened

This should be mostly a duplicate of https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth#Timeline with some more context, yes?
- 2016-07-07 bd808 randomly noticed an irc call to look at a error message posted on enwiki village pump. Then he fell down the rabbit hole of discovering the bug report and trying to figure out why CentraAuth was blowing up leading to a patch that would plaster over the failure by ignoring and then trying to cleanup.
- Thousands of users got an error when trying to log in.
- There were a variety of causes, including CentralAuth, Echo and mw-config, that made the issue worse and better over time ( https://phabricator.wikimedia.org/P3370 )
- The issue was reported but the reports were not being looked at by key people until very late (after 2 weeks of login failures)
- [....]
  - Matt: I'm still not sure if/why there was a spike starting 2 weeks before. The Echo spike should only have started July 5th from https://gerrit.wikimedia.org/r/#/c/297434/ (then soon the other wikis as well) (and there was a bigger spike then). There are some signs of a possible spike 2016-06-30-ish though (https://phabricator.wikimedia.org/P3370).
- At one point Bryan investigated and found a backtrace that implicated Echo, posted that to the task
- Ori realized the impact of the bug and rolled back the cluster. Then pinged Roan because of the Echo connection
  - For completeness, rolling back to the previous wmf.XX version the cluster did not resolve the bug.
- Roan saw IRC ping in #mediiawiki-security (on his phone on 2G) and escalated to other collab team members who were not on vacation
- Roan wrote one patch, Matt and Brad (and others?) wrote more

What went poorly

The big issue was not the fact that the bug was introduced nor even that it rolled into production. We have a very complicated software stack and we accept some amount of risk in exchange for the ability to prototype / iterate rapidly. Bugs happen; that's OK. MediaWiki is not (yet) embedded in pacemakers. The big issue is a collective failure to notice this and treat it with the appropriate urgency. That being said, some reflection on what is missing in our ContInt stack would be fruitful.

UBN! for months, but business as usual.
- Was it clear who should be working on this? "UBN!" theoretically means "everybody responsible drops everything now and works on this". Was it clear who "everybody responsible" was in this case? Were they aware?
- no, it did not have an owner. (see also below)
  - RelEng should just block all deployments when this happens (="this" being a high-severity issue with no clear ownership / mitigation plan)
    - So did RelEng know about this UBN bug?
    - As for why Greg didn't, he/I basically processed this task the same as Andre did: https://phabricator.wikimedia.org/T140207#2457777 "There has always been some activity on the task hence it never really felt 'stuck', it just took too long. So while it was on my watchlist I failed to identify a moment when to escalate."
  - The task was marked UBN in 2015 Dec when it really wasn't.
- Thousands of users got an error when trying to log in.
- The Collaboration team was not contacted via a persistent communication mechanism, but by pinging one member (who was on vacation) in a private IRC channel
  - Not entirely fair. It's not as though Ori thought this was adequate / complete communication. I had to go AFK and you noticed in the interim.
    - Fair enough, you did intend to do this properly but ran out of time to do it. Then again nobody else did it either.
  - Recommended other ways: email me (and get a vacation responder telling you who to email instead), email the team list, tag the bug with #echo
    - And mention it in IRC in #wikimedia-collaboration, and maybe ping someone on Hangouts.
- Nobody "owns" central auth. At the time the bug was initially reported Legoktm was the closest person to an "owner" since he had been the last to work on big changes with actually completeing the global rename project. Why does no Product Vertical own authentication?
  - Probably because the "verticals" don't focus on horizontal things.
    - I don't think that's fair, Reading Infra has been doing lots of work on CentralAuth and auth-related things, made some of the changes that broke this, and made some of the changes that fixed this. Perhaps they should be CA's owners since they have the people with practical knowledge about CA?
    - For the record, in my (Brad's) opinion Reading Infrastructure is "Reading" in name only.
      - Sure. But that doesn't mean "verticals cause CA to be ownerless" is true IMO
        I differentiate between "Team X owns it" and "These people more-or-less own it, regardless of what team they happen to be on". RI as a team doesn't own much of anything I can identify, it has never been clearly defined what the team is supposed to be [omitting rant].
        Fair enough, and I agree that horizontal things are not covered well. It's just that in this case, I believed there was an owner.

What went well

Generally, once people were on it, they were really on it.
This meeting :)

What still confuses us

CentralAuth
How is it that the problem festered for six days
Some people thought Reading Infra owned CentralAuth / auth things in general, they say they don't. So who was responsible for fixing this issue?
bd808: Our implicit notion of "ownership" is often "person whom we expect to write out the patch to fix issue", whereas "who notices and flags UBNs" is actually more important in some ways
Are we "safe" right now on this subclass of bug in these components?

Proposal: when a UBN! bug is filed (or when an existing bug is escalated to UBN!), the first order of business process-wise should be assigning a "process" owner. The owner's job is not to implement the fix (though s/he may end up doing that) but to represent / communicate the issue to the wider community, reach out to developers, indicate to RelEng the production impact, make sure it doesn't fall of the radar, etc. ( can we ratify this? ) Talk to people and make noise until it is on the map

If you see an UBN!, you have to assign it to someone and make it clear that you are assigning responsibility. Assigned person is on the hook for reaching out to relevant people, making sure that the required work is assigned that that there is follow-through, etc.

So what happens if the person you assign it to says "no"?
- I (Greg) communicate to their manager. Seriously, this shouldn't be an issue.
  - And if their manager says "no" too? Sometimes it's just not the right person.
    - Find someone who knows better, you know better than me who knows what to do.

Pre-meeting notes

Need for monitoring (and alerts) of "business metrics" (edits, logins, sign-ups)
- Just monitoring the volume of errors emitted by MediaWiki is not enough. There are frequently bugs in production that generate a lot of log noise, but are not high-impact. People are conditioned to ignore it.
  - There should be no situations where log noise is OK. Precisely to avoid situations like this one.
    - Agreed, but longstanding habits / patterns are difficult to change.
      - RelEng needs to ~~beat~~ nudge people then ;)
- It should be easier to find out which deployment coincided with a change in log volume
  - easier than logstash?
People should feel empowered to call attention to an issue loudly if they recognize that it is severe and not getting enough attention.
- MatmaRex and Krenair seem to have recognized that things were becoming catastrophic (https://phabricator.wikimedia.org/T119736#2451066). Said so on the task. That's great, but if no one responds, ESCALATE.
Incident response and postmortem practices should be oriented around PROCESS and PRACTICES and TOOLING, not assigning culpability ("Matt is working on an incident report for this" (https://lists.wikimedia.org/pipermail/wikitech-l/2016-July/086044.html ) says to Ori: "It was Matt's fault and he'll answer for it", which is the wrong way to frame this.)
- Who is "me"? That's not the implication I (Roan) got from that, but then I'm used to how incident reports work
- Matt: That wasn't what I meant to convey. I meant more along the lines of, "I was one of the people who helped fix this, and also one of the people maintaining the code that caused the bug. This is what I think happened. Please update if I got stuff wrong, and let's use this to help learn lessons going forward." (I see that quote is actually from Greg's email. I said something similar, "I am already writing an incident report, and I welcome a discussion.", in mine.
https://lists.wikimedia.org/pipermail/wikitech-l/2016-July/086041.html should have included link to postmortem, a concise summary of what went wrong, and an impact statement.
- Matt: What is the difference between a postmortem, an impact statement, and an incident report?
Grumbling on private channels isn't super useful.
- Grumbling at people who are on vacation also isn't useful
  - But occasionally cathartic