Wikimedia Cloud Services team/Incident Response Process

From Wikitech

WMCS Incident Response Process

This is a description of what steps to take when responding to a WMCS incident.

This document is based on the one maintained by the SRE team at Incident_response/Runbook, adapted for the needs of the WMCS team.

More context on why we decided to create this document is at T348887 Decision Request - Incident Response Process.

The oncall process is described Clinic duties#Oncall_duty.

What is an incident?

One possible definition of incident is “any unplanned disruption or degradation of a service that is actively affecting users”. But when is a disruption serious enough to require following the process defined in this page?

In WMCS we have not defined (so far) any specific Service Level Objective (SLO). To decide whether an issue should be handled as an incident, here are some criteria (some are a bit vague but we will refine them over time):

  • Service disruption - any event that leads to a fully supported service being unavailable or experiencing severely degraded performance should be considered an incident. This could include server outages, network issues, software bugs causing crashes etc.

  • Impact on users - if the incident affects the end-user experience, such as preventing them from accessing a functionality or causing errors in functionality, it should be treated as an incident
  • Thresholds - we should establish thresholds/criteria for what should be considered an incident based on factors like severity, duration, and number of users affected. Example, an incident might be triggered if a service is down for more than a certain amount of time or if a certain percentage of users are affected.
  • Business impact - we should consider the business impact of the incident. If it results in breaches of SLOs (that we should consider setting up sooner than later), or damages in our reputation, it should be treated as an incident.

  • Cross-service dependencies - any incident that may have cascading effects on others due to dependencies causing them to be considered in any of the other points should be treated as an incident.
  • If you’re not sure whether an issue is an incident, it’s better to declare one. If it turns out to be unnecessary, you can stop following the Incident Response Process later.

Examples of incidents

  • The WMCS Ceph cluster is down
  • ToolsDB is down or read-only
  • More than 50% of Toolforge tools are not responding

Examples of issues that are NOT incidents

  • A single Toolforge tool is not working as expected (we could make an exception for critical tools, but we don’t have a list of critical tools at the moment)
  • A single Cloud VPS VM is not reachable
  • It is not possible to start a new VM in Cloud VPS, but existing VMs are working normally

WMCS incidents vs Production incidents

WMCS incidents are incidents that affect only services owned by the WMCS team.

Sometimes an incident involves both WMCS and Production services, in this case it should be handled by the SRE team following their runbook. WMCS engineers are welcome to help, but there should be only one Incident Coordinator.

If you’ve been paged

  • Don’t panic. Even if a service is down, you have time to communicate.
  • Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
  • Acknowledge the page on Victorops. If the page is not acknowledged after 10 minutes, VictorOps will page all WMCS engineers in the same time zone. After 30 minutes from the original page, VictorOps will page all WMCS engineers in all time zones..
  • Speak up in #wikimedia-cloud-admin to say you got the page and you’re looking at it: /me paged: {brief summary of the page message}
  • Acknowledge the alert in Alertmanager. Remember that WMCS alerts are visible at using the “team=wmcs” filter.
  • If the alert is a clear false alarm, you can stop here and follow on later on to adjust monitoring.
  • If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security.

Declaring an incident

Note: most incidents start with a page, but sometimes our alerting system does not detect the issue. You can still declare an incident even if there was no page.

  • If the incident is user-facing, update the topic of #wikimedia-cloud, using !status ongoing incident {a few words describing the issue}
  • If there’s not yet a status doc, start one by making a copy of the template (File -> Make a copy).
    • The document will be automatically saved to the shared Status Documents folder
    • Rename it to include [WMCS] after the date, e.g. 2024-04-15 [WMCS] Something went wrong
  • If the issue also affects services that are not owned by WMCS, you should escalate to the SRE team using Klaxon.
  • Consider setting up a Google Meet incident room where people in the team can discuss the incident, share their screen, etc.
    • Avoid using the Google Meet chat to share links, logs and other information, as that will be lost when the meeting is over. Use IRC instead.
  • If you need help, you can also send a page to somebody else in the team: from the VictorOps web interface, open the incident page and click "Add responders".

Managing an ongoing incident

  • Every WMCS incident should have an Incident Coordinator (IC).
  • The IC will be the person in charge of updating the incident status doc (more details below).
  • If you're a WMCS engineer and there's no IC yet, you should become the IC.

To become the IC

  • If there is an offgoing IC, ensure that you are both in agreement about the handoff.
  • Announce in #wikimedia-cloud-admin, “I am the IC.” You are now the IC.
  • Update the status doc to say “IC: <your name and IRC nick>” and add the IC handoff to the timeline.

When you are the IC

  • Communicate, don’t deep dive. Resist the temptation to troubleshoot the issue; let others do that. Your job is to keep the big picture. If you’re uniquely suited to solve the problem yourself, hand off the IC role to someone else.
    • The WMCS team is quite small, so it’s possible you are the only engineer who is online during an incident. As soon as someone else gets online, consider handing off the IC role to them.
  • Keep track of what needs to be done, and what everyone is working on. Assign tasks as needed to make sure everything is covered and no one is doing conflicting work.
  • Set a timer (or delegate another user): every half hour, make sure to:
    • Keep the status doc up to date. When new information comes in, or engineers take action to work on the problem, update the doc.
    • Send an update to the cloud-announce mailing list. Follow up with more information if the incident is not resolved after a few hours.
    • If the incident has a wider impact on the community, consider sending an update to the wikitech-l mailing list as well.
  • Ask questions. It’s important for you to be fully informed, and it’s also likely that if you don’t know the answer, others don’t either.
    • If you’re not sure what someone is doing, ask them.
    • If someone was investigating a question and you never saw an answer, follow up.
    • If the team agrees “we should do X,” ask who is going to do it -- or assign it to someone.
  • Continue to actively work as the IC until you hand off the role to a specific person or until the incident is over.

When you are working on an incident but you are not the IC

  • Watch IRC while you work.
    • If others are talking to you, make sure you’ll know.
    • Talk in IRC while you work.
    • Inform what are you doing and where you're looking - share your dashboards, metrics etc.
    • Don’t take any action without announcing it first.
    • Keep the channel free of unnecessary chatter during the incident.
  • Log your actions to the SAL. It’s better to log too much than too little.
    • In #wikimedia-cloud, say !log admin Restarted foobaroid on xyz1234.
  • If you need more people to help you, tell the IC.
  • If you have a question no one has asked, or you know something no one is talking about, speak up -- even if you think someone must have thought of it already.
  • After one person has been the IC for several hours now, or if it’s near the end of their workday, consider asking them if they would like a replacement IC.
  • If you are logging off, make sure that you write a note in IRC about the incident status, and any needed follow-up actions.

To hand over the IC role to another person

  • If the incident is in progress, you are the IC until someone takes over from you.
  • Make sure the status doc is up-to-date with everything you know.
  • Make sure the new IC has a full understanding of the situation so far: what’s known, what’s unknown, and who’s working on what.
  • Make sure they know they are the IC.
  • Make sure they update IRC and the status doc to show they’re the IC.
  • You are no longer the IC. Good job!

To resolve the incident and stop being IC

  • Even if there’s still work to do, you may not need an IC if that work is no longer urgent. When remaining tasks can wait until normal working hours, the IC can end the incident.
  • Update the status doc with everything you know. Remind others to do the same. This is much easier now than it will be later. Update the incident status to “resolved.”
  • Make sure unfinished work is tracked in Phabricator, and tasks are linked from the doc.
  • Make sure alerts in VictorOps are resolved. Alerts that are "acknowledged" but not "resolved" will trigger a new page every 24 hours. Most VictorOps alerts resolve automatically when the Prometheus alert is no longer firing.
  • Announce in IRC, “I am resolving the incident.” Mention the status of any continuing issues. Make sure to update each channel where the incident was discussed.
  • You are no longer the IC. Good job!

Writing an incident report

If you are the IC who marked the incident as “resolved”, you should make sure that the incident is filed and scored.

  • WMCS incidents that did not involve production services and production SREs are still categorized together with production incidents at Incident status.
  • If the topic is of a sensitive matter (PII or security-related) then keep the incident status in Google docs.
  • As a person with first account experience on the incident, you should be in a good position to write the initial version of the report, as you will have a good overview of the incident and its evolution over time, even if you are not an expert on the service.
  • For areas you don't have the expertise on, the suggested course of action is finding a subject matter expert to help fill in the details. This can mean co-writing or delegating, depending on the situation.
    • If a subject matter expert is not available, or it's not obvious who they might be, contact a manager of the team most likely to own the service.
  • Unless further research is needed, having a report early, while details are fresh in one's mind is highly encouraged - refining can be done later on during the review stage.
  • Add a link to the incident report in the etherpad for the next WMCS weekly meeting.
  • Also add a bullet point to the SRE meeting notes for awareness.