Incident review ritual

From Wikitech

Intro

This document examines the current status quo regarding incident reviewing techniques, identifies weaknesses and suggests a new process (hereby called “the ritual”) to address some of the shortcomings.

Problem statement

WMF has a wealth of information in the form of Incident Reports that are fully public. This spans back years and is relatively well structured as there are 2 templates, lightweight and full that are consistently used to populate them. However the templates go so far as to create the documents and share it more widely. To make matters worse, the Incident Documents are meant to be widely public. That ended up creating the need for a coordination process during the incident, which currently is in the form of another document template

There are a number of shortcomings regarding the process. A non exhaustive list can be found below:

  • The quality of Incident Docs varies widely. Some are very good, with a complete timeline, well written summary, graphs and actionables, others are terse, with few or no actionables, and a summary that does not help to understand neither the causes nor the impact of the incident. The vast majority lies somewhere between the 2 ends of the  spectrum described above.
  • Extracting information about specific things like number of responders, people paged, number of users impacted, etc requires a human to read the entire document, understand it and reason about it.
  • Neither simple, nor more involved statistical analysis can happen easily. While crude statistics like “How many incidents did we have this month” are possible, it’s not possible to answer questions like “How many people have been paged outside their working hours this month”, “What was the total amount of time incidents lasted for this month” or “How many requests have failed due to incidents this year”. Thus, patterns that are harmful to team morale and damaging to end users are not easy to surface.
  • While incident documents are public, meaning anyone can read them, there is no process to review them systematically, deriving enduring value from them. Thus, knowledge that can be shared easily between Individual Contributors regarding our infrastructure in the content of an incident remains unshared and needs to be re-learned and re-discovered via experience.
  • Newcomers are onboarded in a culture that while it strives to be blameless, it fails to share feelings regarding outages or the fact that people make mistakes. As a result, Imposter Syndrome is a possibility among both newcomers as well as people with longer tenure.

The aim of this document is to provide a proposal for solving some of the above shortcomings, but not all.

Proposal

Establish a bi-weekly (that is happening every 2 weeks) “Ritual” (meeting that is) where incident docs will be reviewed. We suggest the following:

  • Timeslot: Every other week in the complimentary timeslot of the large SRE Monday Update meeting. That is 09:00-10:00 PST. We don’t suggest UTC here in order to keep the same expectations as the SRE Monday meeting.
  • SRE attendance is optional but highly recommended, especially for those that have responded to an incident in the meeting’s agenda. The agenda for the next meeting will be shared in advance by the ritual runner. Anyone in WMF is welcome to join too and encouraged to do so if they are eager to learn or have participated in an incident recently.
  • The ritual should prioritize reviewing recent incidents. The group (via the RitualRunner) can diverge from that for incidents that they deem important/impactful. We expect that this way we will use the fresh memories of participants in the most fruitful way. We will  work backwards to review older ones as capacity allows. We hope the incident rate will decrease or continue to have the current trend, thus allowing us to catch up to previous incidents. Be conservative in the amount of Incidents that are to be reviewed. 1 or 2 will probably cover the entire slot.
  • The incident reviewing ritual should help serve at least 2 different purposes. First, manage to create some more structured data allowing us to do some more rigorous statistical analysis. Second, share knowledge about the incident, the infrastructure and the processes that are being used during an incident. The former is for our overall benefit, but it is also something that parts of can happen asynchronously and we encourage doing so.

Ritual walkthrough

Assign a person running the ritual

A different person should be running the ritual every time, the “RitualRunner”. It is suggested that initially we track and pass the role between the people participating in the ONFIRE group, however it is envisioned that later (assuming the process is successful) the responsibility for that is passed on to a dedicated group of people that alternate the role between them. The role duties generally include:

  • Tracking time
  • Making sure all participants have had a chance to contribute
  • Preparing the ritual by:
    • Getting acquainted with the Incident(s) that are going to be reviewed
    • Generating the required doc(s)
    • Prefilling as much information as possible in the scorecard from the available sources
    • Reach out to responders to inform them they should participate in the ritual (both asynchronously and synchronously)
  • Updating the Runner rotation tracking sheet and informing the next ritual runner.
  • Leads the effort to Identify and assign Action Items resulting from the Ritual. Those may be related to the incident itself, the process for documenting the incident as well as the Ritual process itself.

Advice to the Ritual Runner

  • Prepare ahead of time.
    • Reach out to people (e.g. responders) that you feel they should participate in the Ritual and ask them to.
    • Get the doc with the scorecard and the walkthrough ready before the Ritual and share it with at the very least the responders.
    • Ask people (e.g. responders) to pre-fill the scorecard asynchronously
  • During the meeting, make sure to screen share the doc where the scorecard will be. That will minimize time spent coordinating verbally with other participants
  • Make sure to track time. Employ tools/technologies that would help you with this (e.g. an alarm). Be flexible when you see interesting discussions but also make sure you don’t let a specific sections monopolize the Ritual
  • Similarly, don’t let individuals monopolize the Ritual either. An incident responder’s recap might be important and useful but extended discussions about a potential solution are best left for a different venue.

Filling in the scorecard

The previously agreed upon Metadata Scorecard should be filled.

Incident metadata
Incident ID datestamp + service and event Start YYYY-MM-DD hh:mm:ss
Task End YYYY-MM-DD hh:mm:ss
People paged Responder count
Coordinators Affected metrics/SLOs
Impact Who was affected and how?

Metadata Dictionary

Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual

  • Incident ID: Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
  • UTC Start Timestamp: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
  • UTC End Timestamp: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
  • Incident Task: If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
  • People Paged: Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
  • Responder count: This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
  • Coordinator(s): Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
  • Relevant Metric(s)/SLO(s) Affected: If an SLO exists in the Published SLOs page use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
  • Summary: Straight out of the public Incident doc or the Incident Status doc if the former does not exist yet.

Advice

Filling in the scorecard should not be costly. Try to get the easy fields pre-populated before the meeting. Chasing down numbers and timelines is not a good use of high bandwidth collective engineering time. Mark the fields you want to talk more about during the meeting. Allow others to bring up fields for extra discussion too, but be conservative.

Make sure the scorecard filling part of the process does not take too long. It’s important that people feel comfortable in these meetings and are not getting the impression we are delving too much into minutiae. A good rule of thumb is to time box the scorecard discussion to something like 15-20 minutes.

Sharing knowledge

Sharing knowledge is a big part of the incident review process. The aim is that more people get acquainted with the infrastructure, leading to them feeling more comfortable with it, owning it more and hopefully leading to shorter incident durations. We suggest this part of the ritual is split in 2 sections, a quick walkthrough and a section with questions.

Walkthrough the incident

If we only cared about the scorecard filling, we could do this asynchronously and even have a single person going through that. However, there are benefits in increasing participation and communication bandwidth by doing it in a synchronous way. Increasing inter-team support, communication, learning and cultivating a healthy culture are some obvious ones. Furthermore, it can increase the accuracy as well as the speed with which we are filling the scorecard

Template

Below is a template of the various items that will probably need to be discussed and talked about. The list is not exhaustive, but a good starting point.

  • Summary
    • Already in the scorecard, but feel free to expand on this.
  • Leadup (if any)
    • E.g. A configuration change was pushed at <HH:MM>
  • Trigger
    • A celebrity death, a configuration change, a hardware fault
  • Impact
    • Already in the scorecard but feel free to expand on this. On second sight, people might come up with impacts that have been missed (in which case the scorecard will require an update)
  • Detection
    • Did our monitoring detect the issue? Or not? Why not?
  • Response
    • Covered by the scorecard but you may want to expand on things like:
      • How many were paged? How many responded? Was escalation to managers needed? Was escalation to other ICs needed?
  • Recovery
    • Did it happen on its own? Was action taken? Automated or manual?
  • Other
    • Have there been any other incidents of this nature? If yes, why?
    • Is there some plan that is in the works that would make the incident improbable to happen again? Should we re-prioritize it?

Advice

  • Make sure, at the start of the Walkthrough to:
    • Vocally repeat and point out that this IS NOT a process for administering punishment.
    • Vocally differentiate between “anonymous” and “blameless”. We want people to talk about their mistakes so that we can identify issues with our processes and infrastructure so we can all learn. Encourage standing up.
  • Keep the review blameless. No finger pointing, no bullying, constructive criticism only. Be ready to react and protect the ones involved in an incident.
  • Use the template but don’t be constrained by it! Walking through an incident can be done in many ways, e.g. a way would be to treat it like a story worth telling. But building good story telling skills takes time and effort, the template will help but it can get you only so far. If you have been a responder and want to communicate feelings like fear, anxiety, it’s fine. Others had those feelings too.

Leave space for questions

This is a pretty important part of the ritual. People should ask questions to help them clarify their understanding of the situation and the infrastructure. Better understanding of the infrastructure and the situation will avoid hearsay and Cargo Culting as well as allow us to attack assumptions, made in times past, that no longer apply. This will become especially important if people that are not part of SRE participate in the ritual.

To accomplish the above we suggest that as the ritual runner:

  • Leave plenty of room for questions at the end. 15-20 minutes is our suggestion
  • Frequently ask whether anyone has questions during the walkthrough
  • Have a simple starter question ready in case there are no questions from the other participants to break the ice. Ask it yourself and either answer it yourself or have someone else who is knowledgeable enough answer it, preferably the latter.