User:LSobanski (WMF)/ONFIRE/Incident response/Runbook
Appearance
If you’ve been paged
- Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
- Speak up in #wikimedia-operations to say you got the page and you’re looking at it. Read up in that channel for context.
- If the alert is a clear false alarm, you can stop here.
- If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If there’s too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operations connect.
- Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.
- If you are oncall and the other oncall person is available, agree on who will take IC and who will do the troubleshooting.
- If you are oncall and the other oncall person is unavailable, alert others by mentioning
#page
in IRC and do the troubleshooting until other engineers are available, at this point take IC - Acknowledge either the Icinga or Alertmanager alert(s).
- Acknowledge the incident on Victorops
In the event of insufficient help/backup, call Mark or a Director. For phone numbers, see Office Wiki's Contact list. (restricted) |
If there was no page, but...
- If the issue affects users, and three or more people are working on it, there should be an IC.
- If the issue needs continuous attention, so you’ll be handing it off until it’s resolved, there should be an IC.
- If you’re not sure whether there should be an IC, it’s better to have one. If it turns out to be unnecessary, you can stop later.
- If you're an SRE and there's no IC yet when one is needed, you should become the IC. Alert others by mentioning
#page
in IRC, and proceed below.