Jump to content

Incident response/Runbook

From Wikitech

This is a brief, at-a-glance description of what steps to take when responding to an on-going incident.

Don’t panic. Even when the wikis are down, you have time to communicate.

If you’ve been paged

  • Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
  • Speak up in #wikimedia-operations to say you got the page and you’re looking at it. Read up in that channel for context.
  • If the alert is a clear false alarm, you can stop here.
  • Use corto in #mediawiki_security to log handover or view previous handover notes, which may be relevant.
  • If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If there’s too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operationsconnect.
  • Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.
  • If you are oncall and the other oncall person is available, agree on who will take IC and who will do the troubleshooting.
  • If you are oncall and the other oncall person is unavailable, alert others by mentioning #page in the #wikimedia-operations IRC channel and do the troubleshooting until other engineers are available, at this point take IC
  • Acknowledge either the Icinga or Alertmanager alert(s).
  • Acknowledge the incident on SplunkOnCall

If you need help

If there was no page, but...

  • If the issue affects users, and three or more people are working on it, there should be an IC.
  • If the issue needs continuous attention, so you’ll be handing it off until it’s resolved, there should be an IC.
  • If you’re not sure whether there should be an IC, it’s better to have one. If it turns out to be unnecessary, you can stop later.
  • If you're an SRE and there's no IC yet when one is needed, you should become the IC.

To become the Incident Coordinator (IC)

  • If there is an offgoing IC, ensure that you are both in agreement about the handoff.
  • Announce in IRC, “I am the IC.” You are now the IC.
  • Use cortobot in IRC chat to check if an incident is already filed. If not, use corto bot to create the incident (see `cortobot help` for syntax). This will create a task and a Status document.
  • Update the status doc to say “IC: <your name and IRC nick>” and add the IC handoff to the timeline.
  • If possible, assign an incident severity to the incident so as to guide response and clarify impact.
  • If it's not already, put a link to the status doc in the topic of #mediawiki_security, along with a few words identifying the incident (“foobaroid OOMs”) or at least the date.

When you are the IC

  • Communicate, don’t deep dive. Resist the temptation to troubleshoot the issue; let others do that. Your job is to keep the big picture. If you’re uniquely suited to solve the problem yourself, hand off the IC role to someone else.
  • Keep track of what needs to be done, and what everyone is working on. Assign tasks as needed to make sure everything is covered and no one is doing conflicting work.
  • Set a timer (or delegate another user): every half hour, make sure to:
  • Ask questions. It’s important for you to be fully informed, and it’s also likely that if you don’t know the answer, others don’t either.
    • If you’re not sure what someone is doing, ask them.
    • If someone was investigating a question and you never saw an answer, follow up.
    • If the team agrees “we should do X,” ask who is going to do it -- or assign it to someone.
  • Using the guidelines on officewiki, evaluate whether you need to notify SRE Directors, Legal, Comms, or WMF leadership. If so, either contact Directors yourself or assign someone to do so.
  • Continue to actively work as the IC until you hand off the role to a specific person or until the incident is over.

When you are not the IC

  • Watch IRC while you work. If others are talking to you, make sure you’ll know.
  • Talk in IRC while you work. Don’t take any action without announcing it first. Keep the channel free of unnecessary chatter during the incident.
  • Log your actions to the SAL. It’s better to log too much than too little.
    • In #wikimedia-operations, say !log Restarted foobaroid on xyz1234.
    • If the incident is security-sensitive, instead use !log-private in #mediawiki_security for visibility, even though it doesn’t actually log anywhere.
  • If you need more people to help you, tell the IC.
  • If you have a question no one has asked, or you know something no one is talking about, speak up -- even if you think someone must have thought of it already.
  • After one person has been the IC for several hours now, or if it’s near the end of their workday, consider asking them if they would like a replacement IC.
  • Check the Temporary incident response steps for ideas
  • At the end of your shift, use cortobot to mention the incident in the handover notes

To hand over the IC role to another person

  • If the incident is in progress, you are the IC until someone takes over from you.
  • Make sure the status doc is up-to-date with everything you know.
  • Make sure the new IC has a full understanding of the situation so far: what’s known, what’s unknown, and who’s working on what.
  • Make sure they know they are the IC.
  • Make sure they update IRC and the status doc to show they’re the IC.
  • You are no longer the IC. Good job!

To resolve the incident and stop being IC

  • Even if there’s still work to do, you may not need an IC if that work is no longer urgent. When remaining tasks can wait until normal working hours, the IC can end the incident.
  • Announce in IRC, “I am resolving the incident.” and use corto bot to resolve the incident. Reference the status document for any follow-up discussion. Make sure to update each channel where the incident was discussed.
  • Update the status doc with everything you know. Remind others to do the same. This is much easier now than it will be later. Update the incident status to “resolved.”
  • Make sure the Follow-up Actions Items (already Done, Accepted or Draft) are captured in the corresponding section of the Status document. Accepted action items must be filed in Phabricator, and tasks should be linked in the Status document.
  • Once all Action Items are filed, close the Incident Phabricator Task created by cortobot in 'Active Investigations'.
  • You are no longer the IC. Good job!
  • Make sure there are no pending incident reports about outages that happened during your shift

Writing an incident report

As an IC, you own making sure that all incidents that happened during your shift are correctly filed:

  • As a person with first account experience on the incident, you should be in a good position to write the initial version of the report, as you will have a good overview of the incident and its evolution over time, even if you are not an expert on the service.
  • For areas you don't have the expertise on, the suggested course of action is finding a subject matter expert to help fill in the details. This can mean co-writing or delegating, depending on the situation.
    • If a subject matter expert is not available, or it's not obvious who they might be, contact a manager of the team most likely to own the service.
  • Unless further research is needed, having a report early, while details are fresh in one's mind is highly encouraged - refining can be done later on during the review stage.
  • If for some reason you cannot file the report (e.g. you go on vacation) make sure to find someone to do it for you (e.g. the other person on call with you)
  • For the next SRE meeting, add a bullet to the SRE meeting notes for awareness
  • If the incident worth detailed discussion, or involved teams outside SRE, ask management group to put it in the agenda of next Incident review ritual