Incident response/Runbook
Ongoing incident? Create an incident response document (File -> Make a copy) and post an update on the Wikimedia status page.
This is a brief, at-a-glance description of what steps to take when responding to an on-going incident.
Donât panic. Even when the wikis are down, you have time to communicate.
If youâve been paged
- Stop everything else youâre doing. If you can, respond even if youâre not at your desk.
- Speak up in #wikimedia-operations to say you got the page and youâre looking at it. Read up in that channel for context.
- If the alert is a clear false alarm, you can stop here.
- If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If thereâs too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operations connect.
- Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.
- Acknowledge either the Icinga or Alertmanager alert(s).
In the event of insufficient help/backup, call Mark or a Director. For phone numbers, see Office Wiki's Contact list. (restricted) |
- Acknowledge the incident on Victorops
If there was no page, but...
- If the issue affects users, and three or more people are working on it, there should be an IC.
- If the issue needs continuous attention, so youâll be handing it off until itâs resolved, there should be an IC.
- If youâre not sure whether there should be an IC, itâs better to have one. If it turns out to be unnecessary, you can stop later.
- If you're an SRE and there's no IC yet when one is needed, you should become the IC. Alert others by mentioning
#page
in IRC, and proceed below.
To become the Incident Coordinator (IC)
- If there is an offgoing IC, ensure that you are both in agreement about the handoff.
- Announce in IRC, âI am the IC.â You are now the IC.
- If thereâs not yet a status doc, start one by making a copy of the template (File -> Make a copy).
- Update the status doc to say âIC: <your name and IRC nick>â and add the IC handoff to the timeline.
- If it's not already, put a link to the status doc in the topic of #mediawiki_security, along with a few words identifying the incident (âfoobaroid OOMsâ) or at least the date.
When you are the IC
- Communicate, donât deep dive. Resist the temptation to troubleshoot the issue; let others do that. Your job is to keep the big picture. If youâre uniquely suited to solve the problem yourself, hand off the IC role to someone else.
- Keep track of what needs to be done, and what everyone is working on. Assign tasks as needed to make sure everything is covered and no one is doing conflicting work.
- Set a timer (or delegate another user): every half hour, make sure to:
- Keep the status doc up to date. When new information comes in, or engineers take action to work on the problem, update the doc.
- Keep wikimediastatus.net up to date. Keep our communication with the public updated routinely for transparency. (not sure if you should create a status page? See Wikimediastatus.net#What_merits_posting_on_the_status_page?)
- Ask questions. Itâs important for you to be fully informed, and itâs also likely that if you donât know the answer, others donât either.
- If youâre not sure what someone is doing, ask them.
- If someone was investigating a question and you never saw an answer, follow up.
- If the team agrees âwe should do X,â ask who is going to do it -- or assign it to someone.
- Using the guidelines on officewiki, evaluate whether you need to notify SRE Directors, Legal, Comms, or WMF leadership. If so, either contact Directors yourself or assign someone to do so.
- Continue to actively work as the IC until you hand off the role to a specific person or until the incident is over.
When you are not the IC
- Watch IRC while you work. If others are talking to you, make sure youâll know.
- Talk in IRC while you work. Donât take any action without announcing it first. Keep the channel free of unnecessary chatter during the incident.
- Log your actions to the SAL. Itâs better to log too much than too little.
- In #wikimedia-operations, say
!log Restarted foobaroid on xyz1234.
- If the incident is security-sensitive, instead use
!log-private
in #mediawiki_security for visibility, even though it doesnât actually log anywhere.
- In #wikimedia-operations, say
- If you need more people to help you, tell the IC.
- If you have a question no one has asked, or you know something no one is talking about, speak up -- even if you think someone must have thought of it already.
- After one person has been the IC for several hours now, or if itâs near the end of their workday, consider asking them if they would like a replacement IC.
To hand over the IC role to another person
- If the incident is in progress, you are the IC until someone takes over from you.
- Make sure the status doc is up-to-date with everything you know.
- Make sure the new IC has a full understanding of the situation so far: whatâs known, whatâs unknown, and whoâs working on what.
- Make sure they know they are the IC.
- Make sure they update IRC and the status doc to show theyâre the IC.
- You are no longer the IC. Good job!
To resolve the incident and stop being IC
- Even if thereâs still work to do, you may not need an IC if that work is no longer urgent. When remaining tasks can wait until normal working hours, the IC can end the incident.
- Update the status doc with everything you know. Remind others to do the same. This is much easier now than it will be later. Update the incident status to âresolved.â
- Make sure unfinished work is tracked in Phabricator, and tasks are linked from the doc.
- Announce in IRC, âI am resolving the incident.â Mention the status of any continuing issues. Make sure to update each channel where the incident was discussed.
- You are no longer the IC. Good job!
- Make sure there are no pending incident reports about outages that happened during your shift
Writing an incident report
As an IC, you own making sure that all incidents that happened during your shift are correctly filed and scored:
- If the topic is of a sensitive manner (PII or security-related) then keep the incident status in Google docs.
- As a person with first account experience on the incident, you should be in a good position to write the initial version of the report, as you will have a good overview of the incident and its evolution over time, even if you are not an expert on the service.
- For areas you don't have the expertise on, the suggested course of action is finding a subject matter expert to help fill in the details. This can mean co-writing or delegating, depending on the situation.
- If a subject matter expert is not available, or it's not obvious who they might be, contact a manager of the team most likely to own the service.
- Unless further research is needed, having a report early, while details are fresh in one's mind is highly encouraged - refining can be done later on during the review stage.
- If for some reason you cannot file the report (e.g. you go on vacation) make sure to find someone to do it for you (e.g. the other person on call with you)
- For the next SRE meeting, add a bullet to the SRE meeting notes for awareness