Incident documentation/Report Template

From Wikitech
Jump to navigation Jump to search

This is a template for an Incident Report. Replace notes with your own description. Save the incident report as a subpage of the Incident documentation, named in this style: YYYYMMDD-$NameOfService.

Summary

This is a short (<= 1 paragraph) of what happened. While keeping it short, try to avoid assuming deep knowledge of the systems involved, and also try to differentiate between proximate causes and root causes. Please ensure to remove private information.

Impact

Who was affected and how? For large-scale outages, estimate: How many queries were lost? How many users affected? etc.

Detection

Was automated monitoring the first to detect the issue? Or was it a human reporting an error?

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

  • 18:13 erroneous change mistakenly deployed to all servers at once OUTAGE BEGINS
  • 18:14 icinga pages for high error rate from foobarwiki
  • 18:16 change rolled back OUTAGE ENDS

Conclusions

What weaknesses did we learn about and how can we address them?

The following sub-sections should have a couple brief bullet points each.

What went well?

  • for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

  • for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

  • for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)