Wikimedia Cloud Services team/EnhancementProposals/Decision record T348887 Incident Response Process

Origin task: phab:T348887

Date of the decision: 2024-05-10

People in the decision meeting (alphabetical order):

No decision meeting was held, because we reached an agreement in the Phabricator task.

The following people participated in the discussion on Phabricator:

Decision taken

Option 2.1 was chosen.

A wiki page was created describing the new process: Wikimedia_Cloud_Services_team/Incident_Response_Process.

Problem

When an incident occurs and the WMCS team responds to it, there is not a defined process to follow. This might lead to uncertainty and delays in the response to the incident.

Constraints and risks

Not having a process could in some occasions result in a slow or ineffective response to an incident
At the same time, a process would involve additional work, and that could make the response slower instead of faster
We don't have a clear definition of an incident, and when the WMCS team is responsible for it
We don't have many people in the team and the work required by following a process (e.g. writing detailed incident reports after an incident) might reduce the number of other things we can deliver

Options considered

Option 1

Adopt the [Incident Response Process](https://wikitech.wikimedia.org/wiki/Incident_response/Runbook) used by the Production SRE team, either without any change or with small changes that apply only to the WMCS team.

Pros:

battle-tested process
easier to work together when an incident involves both the WMCS team and other teams

Cons:

designed for a bigger team
designed for production services where incidents can have a much bigger impact compared to WMCS services

Option 2

Write a custom Incident Response Process for the WMCS team, taking inspiration from the Production SRE team but keeping our process separate.

Pros:

we can tailor it to our team
we can evolve the process independently

Cons:

more work to write the process and maintain it
potential source of confusion when an incident involves both the WMCS team and other teams

Option 2.1

Write a custom Incidence Response Process that is a subset of the one used by the SRE team, with some minor adaptations to our case. This include having shared incident reviews, and this means us also going to non-WMCS incident reviews, and other SREs coming to ours (essentially, having the same space). We can tweak the shared incident score card template to be reusable for WMCS (add notes there for fields that don't make sense, should be reinterpreted differently).

Essentially:

Our own "how to handle a page" as that is quite different than SRE (no wikimediastatus.net, no incident coordinator, only few people oncall, ...), this might have some section with "if this is wider than WMCS -> follow SRE process"
Shared "how to document an outage", with minor tweaks (hopefully embedded in the shared doc)
Shared "how to follow up an incident", with minor tweaks (hopefully embedded in the shared doc)

Pros:

Reuses some of the battle tested process, as much as we can (incident documentation and followup)
Adapts the most critical and custom parts to our unique use case
We get insights from other SREs out of the team, and we give our point of view to others

Cons:

Some extra work to keep our own not-shared of the process
Some extra maintenance work to go to SRE incident reviews

Option 3 (status quo)

Don't define any Incident Response Process and self-organize on a case-by-base basis.

Pros:

No additional work/bureaucracy

Cons:

Makes it easier to forget some important steps (e.g. acking the page, updating the status in IRC, writing an incident report, etc...)
Time can be lost discussing how to collaborate and how to divide responsibilities
Less transparency, as information is less likely to be shared during and after an incident
Harder to learn from past incidents, if incidents are resolved without writing reports/documentation