Incident response/Lightweight report
A lightweight incident report may be used to document incidents that are sufficiently straightforward, well-understood, and low-impact that the effort of writing a full-length incident report isn't justified either by the expected readership or by the investigative writing process itself.
The lightweight format consists only of the sections most important to improving the reliability of production over time: a narrative summary of the incident, and a list of action items to prevent recurrence. The other sections of the full report, which are valuable for analyzing a particularly complex incident, are omitted to save substantial time for authors and reviewers in cases where that analysis isn't necessary.
Don't write a lightweight report when the full-length report would actually be valuable. You should write a full-length report if the answer to any of these questions is "yes":
- Did user impact from the incident attract attention from the community or from WMF management? (This IR is valuable because it would have an interested audience.)
- Did the incident highlight—either positively or negatively—something unique or surprising about our stack, tools, or processes in a way that can help SREs to understand it better? (This IR is valuable for SRE education: we can use it to learn about our systems. This is different from "we discovered a problem and we have action items to fix it," which should be true for every incident.)
- After the incident is resolved, is there still uncertainty or disagreement about the chain of events? (In this case, the collaborative process of compiling the timeline and writing the IR can help answer those questions.)
- Can you imagine telling stories about the incident in six months? (The IR is valuable for historical perspective.)
- Has anyone said they’d like to read the full IR? That’s probably a good enough reason.