To measure progress outside of standard incident counts and severities, the SRE team has designed an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score. The scorecard is meant to be filled out as part of the Incident review ritual.
Incident Assessment Overview
Defining a scorecard for the incident management effort facilitates tuning the process to achieve our defined objectives. The scorecard is structured in 2 layers:
- Per Incident: this should be a list of items to assess the management of the incident.
- Per Month/Quarter: an aggregate of the scorecard to review trends over time and YoY
The organization can extrapolate and report on yearly efforts based on the collected data with these two views. Assessment of the SRE team’s collective progress would be visible at different levels of metrics resolution according to the level of reporting needed.
- Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.
- SLI: An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.
- SLO: A Service Level Objective (SLO) is an understanding between teams about expectations for reliability and performance. An SLO is a service level objective: a target value or range of values for a service level measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. (i.e., More than 99% of all requests are successful) defined in SLO - Wikitech
Incident Engagement: Incident engagement is a set of metrics defined within the WMF that help track how we (SRE) as an organization are responding to incidents based on a PPT model: People, Process, and Tooling.
The incident document (Full Report) is the authoritative compendium of information for any given incident. In the case of an incident involving private data/information, a skeleton incident document will be posted to Wikitech as a pointer. This document is the completed artifact produced at the end of any incident. It should consist of all the contextual information needed to understand the incident. Currently, Incident status on Wikitech is that format; the proposal is to augment it by expanding its metadata and adding a “scorecard” section to assess whether or not the incident was managed effectively or not based on the criteria defined to better track incident engagement.
Following is a brief metadata table, proposed to be included at the top of each incident document (and eventual storage in a database, like Phabricator). The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
|Incident ID||datestamp + service and event||Start||YYYY-MM-DD hh:mm:ss|
|People paged||Responder count|
|Impact||Who was affected and how?|
Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual
- Incident ID: Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
- Start: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
- End: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
- Task: If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
- People paged: Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
- Responder count: This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
- Coordinators: Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
- Affected metrics/SLOs: If an SLO exists in the Published SLOs page use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
- Impact and Summary: Impact in one or two sentences. Summary in one or two paragraphs. Copied from the public Incident doc, or the Incident Status doc if the former does not exist yet.
The Incident status page contains the ongoing status updates and notes/timeline during an incident. In addition, this notepad will feed into the overall incident (post-mortem review) document. Using the Create a new incident report box will allow you to quickly create an incident report.
Following is a proposal based on the three assessment rubrics for this incident’s response efforts, each with its point scale and assessment bracket per item. Low scores equal poor performance; high scores indicate positive performance. The intent is not to blame or raise concern but to effectively introspect around how an incident played out without fear of blame or retribution. If anything, low scores should help indicate where to direct attention and priority at an organizational level.
|People||Were the people responding to this incident sufficiently different than the previous five incidents?|
|Were the people who responded prepared enough to respond effectively|
|Were fewer than five people paged?|
|Were pages routed to the correct sub-team(s)?|
|Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.|
|Process||Was the incident status section actively updated during the incident?|
|Was the public status page updated?|
|Is there a phabricator task for the incident?|
|Are the documented action items assigned?|
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?|
|Tooling||To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented.
|Were the people responding able to communicate effectively during the incident with the existing tooling?|
|Did existing monitoring notify the initial responders?|
|Were the engineering tools that were to be used during the incident, available and in service?|
|Were the steps taken to mitigate guided by an existing runbook?|
|Total score (count of all “yes” answers above)|
This scorecard is meant to be filled out as a part of an incident review effort after the incident is complete and the document is written. Part of the metadata and questions can be filled out before the incident review as needed. The goal of the scorecard is to use it as a reflection piece, and a conversation starter to help identify gaps in our current IR efforts. Bad scores are not meant to reflect poorly on responders, but increase visibility and help drive action to the gaps.
The aggregate sccorecard is an average of scores of all the incidents within a specific time period (quarter in this case). We mainly use the monthly scorecard to tabulate results for the end-of-quarter results.
|FY2021/2022 Q2||FY2021/2022 Q3||FY2021/2022 Q4||FY2022/2023 Q1||FY2022/2023 Q2|