Jump to content

User:LSobanski (WMF)/ONFIRE/Incident response/Runbook

From Wikitech

If you’ve been paged

  • Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
  • Speak up in #wikimedia-operations to say you got the page and you’re looking at it. Read up in that channel for context.
  • If the alert is a clear false alarm, you can stop here.
  • If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If there’s too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operations connect.
  • Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.
  • If you are oncall and the other oncall person is available, agree on who will take IC and who will do the troubleshooting.
  • If you are oncall and the other oncall person is unavailable, alert others by mentioning #page in IRC and do the troubleshooting until other engineers are available, at this point take IC
  • Acknowledge either the Icinga or Alertmanager alert(s).
  • Acknowledge the incident on Victorops

If there was no page, but...

  • If the issue affects users, and three or more people are working on it, there should be an IC.
  • If the issue needs continuous attention, so you’ll be handing it off until it’s resolved, there should be an IC.
  • If you’re not sure whether there should be an IC, it’s better to have one. If it turns out to be unnecessary, you can stop later.
  • If you're an SRE and there's no IC yet when one is needed, you should become the IC. Alert others by mentioning #page in IRC, and proceed below.