Talk:Incident response/Process improvement

Discussion

Latest comment: 5 years ago15 comments5 people in discussion

What's an 'incident'?

My suggested criteria for opening an incident, adapted from the SRE Book

Is the issue unresolved even after an hour's work?
Is the coordinated involvement of many people (more than 3) needed to solve the issue?
Is the issue broadly visible to users?
- Defining exactly what this means is tricky, of course; there are many services and bits of infrastructure that e.g. editors care about but readers don't. We should have some guidelines on defining audiences and incident severity here. (Faidon mentioned that we used to have a framework along these lines: affects readers, affects contributors, affects the movement, ...)

I also suggest that active incidents have at least two things: an active commander in charge of coordinating the response, and a living document kept up-to-date with the current state of the incident. (Said document might live on wikitech? and/or turn into an incident report/postmortem once the incident is resolved.)

There's also a bunch of this that might tie into a publicly-accessible status dashboard. However I don't think that they can always be one and the same; I can imagine security-related events that we'd like to be Incidents but wouldn't like to be public until after mitigations have happened. (Maybe this is an argument for incidents being somewhere on Phabricator, which already has this capability?) —CDanis 02:08, 21 February 2019 (UTC)Reply

This would work if we had people constantly triaging alerts/pages during the day, but we don't. So we need to first define how we respond to a paging alert, then get to how to escalate it to a full incident involving more people. Giuseppe Lavagetto (talk) 07:44, 21 February 2019 (UTC)Reply

I honestly thought we were close enough to this happening – whenever something pages, I do at least take a look to see if someone else is handling it already – but sadly I guess you are right. —CDanis 20:34, 21 February 2019 (UTC)Reply

Should we include IRC alerts on this discussion? Marostegui 09:00, 6 March 2019 (UTC)Reply

Definition of our work being done

IMO the most important thing about a definition of being 'finished' with our work is that it includes consensus from the whole SRE team (if not on all the details, at least on the broad strokes). —CDanis 02:08, 21 February 2019 (UTC)Reply

While I agree that consensus must be reached, is that the mission of this working group? I mean should we come up with a proposal or also work on modifying it based on everyone's feedback? I would expect the work to become a shared responsibility of the whole SRE team, but I'm ok with the idea of this group leading the discussion. Giuseppe Lavagetto (talk) 07:19, 21 February 2019 (UTC)Reply

At the January offsite (I think) the target mentioned was to come back with a proposal by the June Offsite, to then discuss, finalize, and adopt in person. So getting rough consensus would be the working group's last task. JAufrecht (talk) 23:42, 1 March 2019 (UTC)Reply

Agree, but we should not forget that other teams are involved too, like WMCS, Fundraising tech and SRE-capacity within Search Platform and Analytics Volans (talk) 13:05, 13 March 2019 (UTC)Reply

As far as I know they will also be at the June offiste (citation needed) Marostegui (talk) 13:46, 13 March 2019 (UTC)Reply

Yes indeed they should AFAIK, but a) they are not represented in the group, so we should take into account that there might be different needs too, just that. Volans (talk) 14:07, 13 March 2019 (UTC)Reply

Scope of work

I think the most important thing we need to create is a shared vision of the following things:

Culture

Is keeping the systems working at nominal conditions our prime directive?
- If not, what is the prime directive?
What are our responsibilities regarding paging alerts?
- How can we best manage to meet our responsibilities while keeping a life/work balance for everyone?
- How can we avoid specific people/subteams from being overwhelmed by pages/incidents?
How should we react to non-paging alerts?
How can we reduce the number of people who get interrupted by each single alert?
Do we need drills/training in incident response?
Do we want subteams to be responsible only for their "own" alerts?
Can we delegate paging / alerting on services on k8s to the dev teams and act as a backup only?

Technical / Procedural

Create a shared set of practices on how to respond to pages
- Define what is a page, what is an incident
- Define an acceptable SLA for our page response
  Is this meant to be SLA for the Team/Person responsible for that page? Marostegui (talk) 09:13, 6 March 2019 (UTC)Reply
  Are we even at the point of having a well-defined notion of what teams are responsible for which pages? My impression is "sometimes" at best. —CDanis 22:01, 7 March 2019 (UTC)Reply
  Not at all. There are things that are a bit clearer than others though: ie: databases (but ie: labsdb are kinda an undefined field, router/switches alerts (there are people with router access, but not all of them respond to alerts or feel confident enough to do so)...but there are very blurry things: ie MediaWiki alerts. Marostegui (talk) 06:38, 8 March 2019 (UTC)Reply
- Define a policy regarding phone availability and escalation
- Define a typical workflow in responding to a page (from acknowledging the page on IRC to finishing an incident post-mortem), including coordination, helping
- Identify what we could do to ease the process of responding to a page
Alerting improvements
- Define a process for new services / features and definition of a SLO
- What is lacking in our alerting system? Focus on finding the largest pain points, esp, in relation with our needs outlined before

I realize this list is incomplete, but I'd be incredibly happy if this working group could tackle even half of these points.

I'd propose we focus on a few of these questions (or others we consider more relevant!) in the coming weeks, everyone gives their feedback on-wiki, then we can meet in person, but I'm up for any other solution. Giuseppe Lavagetto (talk) 07:37, 21 February 2019 (UTC)Reply

Agreed - maybe we'd need to try to focus on trying to solve the meta questions: ie: what is a page? what is an incident? to make sure we all work under the same premises, before getting into more technical questions like "define an acceptable SLA" or "defining a process for new services" Marostegui 09:06 6 March 2019 (UTC)

I think these are all great questions. I would love to get into the question of pages vs incidents as well, but maybe we won't have much time left over after this. —CDanis 22:09, 7 March 2019 (UTC)Reply

Draft Charter

Incident reponse refers to the actions that WMF Service Reliability Engineers take in response to indications of possible failures in the Wikimedia services they are responsible for.

Problem Statement

Alarms (pages et al) are inaccurate and imprecise.
Systems and practices for coordinating response to alarms are inadequate

Goals

Standardize definitions for related terms
Redesign and document the page response process
Identify and resolve associated culture and practice problems
Get rough consensus from SREs on plan

Non-goals

Implement changes. The initial plan ends with the working group returning a proposal to the full SRE team at the next offsite (Dublin, June 2019) for discussion.
- The working group has decided that it may try experiments, tests, and other small implementation attempts prior to Dublin.

Outputs

Documents
- Proposed Vocabulary
- Proposed solutions
  - Problem
  - Solution Options
    - pros and cons
  - Recommendation
  - Rollout and evaluation notes
- Outlines or SOPs for proposed new processes.
Changes to tools, process, and culture
- limited; see non-goals above.

relevant reading

Relevant blog post about Hashicorp's incident response process:

https://blog.danslimmon.com/2019/06/24/an-incident-command-training-handbook/