Incident documentation meeting/20160804

From Wikitech

2016-08-04

  • Who: Kevin, Faidon, Greg

Agenda

  • Intro to problem (stating the obvious) - 2-5 minutes (Kevin?)
  • Go over what we tried before - 2-5 minutes (Greg)
  • Discuss: What are we trying to address? - 10-16 minutes (all)
    • Addressing lingering incident response actionables that are not completed?
    • Blameless identification of 'hot spots' of instability?
    • something else?
    • all of the above?
  • Next steps - 5 minutes (Kevin?)

Notes

Intro G: We have outages. Incident reports get filed. In 2014, Greg did quarterly reviews of incidents that quarter, brought in people involved, checked up on progress. Have not done it since. Mostly due to new responsibilities (mgmt). Now it's clear that we need to start doing something like that again. This meeting is to start figuring that out. What was good about those in the past? What should we do differently? How will we manage the workload?

F: The 2014 meetings were Mark+Faidon+Greg...anyone else? Greg thinks they would have included the people involved, but not certain. Last time, they were of limited usefulness. Nobody complained when they stopped. The meetings didn't seem to inspire changes among the people who could do so. The meetings didn't have much effect. This was the biggest problem. Wouldn't hurt to have them, but might not add value.

G: Should we have them? If so, how?

F: They're not bad, for sure. Not too helpful for me, but maybe they could be helpful. We could brainstorm.

K: What would the benefit be, in a perfect world?

G: Each report has follow-up items. Some get done, some don't.

F: A reminder wouldn't be a bad thing. Maybe a component isn't deployed any more, so the bug is moot.

K: What triggered this renewed interest?

G: Retro from echo auth issue. ( https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth/Retrospective )

F: Sounded like usual lack of ownership around here. We would often end up at the same place.

K: How did this end up in RelEng?

G: I did them before. Semantically it makes sense, but as F said, ops are the main...customer? responder? to outages. They affect them the most, so they have a lot to gain from improvements.

F: I'm happy for us to own it, but would like any help we can get. I don't want to play a turf game.

G: Ownership is less important than a) will we do them and how, and b) should this be a rotating duty to lighten the load. Wouldn't be possible for me to manage the process and do this every quarter.

F: I'm more intersted in making them useful and productive. If they are, we shouldn't have a shortage of people to help.

G: You (F) seemed lukewarm to the idea of reviewing past actionables.

F: My vague memory was that we did that. Between RE and ops, we already knew what happened since we reviewed in weekly meetings. So there were no actions forward. We would ping people, but nothing would really happen. I felt a little depressed.

G: One motivation was that I made this #wikimedia-incident project, and started adding past incidents to it. https://phabricator.wikimedia.org/tag/wikimedia-incident/ I have only gotten back to May so far. 51 tasks since May. Can we make this less depressing by identifying some as no longer important? I would see that as being one of the questions in a review: Is this still a legit task to follow up on?

F: Many of these are months work to fix. e.g. "Phase out gallium". We're looking for whether people still have these in mind and are working on them. Not top priority for them though.

G: That's the reality. Other priorities get in the way.

F: Google gives each team an SLA and downtime budget. Prevents team from cutting too many corners. If they go over budget, tech ops can force that team to stop working on new features to address availability until they are back in budget. Interesting idea, but hard to imagine implementing it here. But maybe we could escalate if a team has too many incidents.

G: Just figuring out which teams/projects/components have a high number of open incident follow-ups and communicating to the team, asking for their plan, might help.

K: Next steps?

G: Reuse the phab task to throw out random ideas: https://phabricator.wikimedia.org/T141287 Just adding the incident tag reminded people. Not much impact, but raised awareness. Let's give ourselves 2 weeks to add thoughts to that task. At that point, I'll email the 3 of us if other next steps aren't clear.

F: Any thought Kevin, since this topic is newer to you?

K: Just doing reminders has value. Raises awareness. Doing some small experiment is appealing to me.


DECISION: We will continue the conversation in T141287 ACTION: Greg will follow up via email in 2 weeks, unless other actions would make that unnecessary