Incident response/Training

From Wikitech

This training attempts to expose the reader to functional, direct information for effective incident response. This training attempts to bake thought patterns for a more prepared response. The reader is assumed as willing (rather than forced) to train since they will be thrown into the fire eventually.

Goals

  • Clarify expectations of roles and response.
  • Establish a baseline of best practices, tips and tricks during an incident.
  • Optimize response efforts by drilling standard operating procedures.
  • On-board new on-callers using practical experiences.

Prerequisite knowledge

Scope of responders

Everyone's priority is to respond to the crises at hand, but each person's responsibilities will vary depending on the role they assume. Until others arrive, you are also responsible for basic communication. Others typically arrive very soon afterwards, at which point these responsibilities can change.

Other organizations have more granular response roles, but WMF employs two: Incident Coordinator and Site Reliability Engineer.

Incident Coordinator
The Incident Coordinator (IC) maintains strong communication between all responders. They ensure all engineers are actively reporting their work; They ask pertinent questions about the incident from an high-level perspective (other engineers are digging deep); They assess the technical capabilities of the team, escalating to other engineers more familiar with the affected systems; They keep the status document up-to-date, providing a memory of the incident for later review.
Site Reliability Engineer (SRE)
This role does not differ much from the usual SRE role in WMF: The only difference is that the IC becomes their coordinator as a response to an immediate goal rather than general system improvements dictated by managers. encompasses technical activities such as forensics, disaster mitigation, and recovery. SRE communications must keep the Incident Coordinator abreast of current activities for effective coordination between the incident response team members.

Available tools

There are endless Wikimedia platforms/services but the following are the most pertinent:

alerts.wikimedia.org
The next generation alerts dashboard. Powered by Karma. On its way to replacing the Icinga web interface. Any day now.
Batphone
A term used to mean "Page all SREs". "Batphone" is on-call when no-one else is scheduled.
Cookbooks
A tool for automating tasks across many servers. Part of a larger ecosystem (Spicerack and Cumin have more in-depth explanations but aren't as relevant here). Similar in practice to automation frameworks like Ansible and its playbooks.
Grafana
Visualizes metrics. Useful for overviews of what's happening in the WMF stack, be it number of requests sent to specific applications, the CPU usage of a particular server, etc. Compare with OpenSearch.
Icinga
Monitoring system for all Wikimedia services and servers. Automatically sends pages out when critical issues are detected. On its way to being replaced by a similar solution, Prometheus/AlertManager and the alerts.wikimedia.org frontend for interactive dashboards.
Incident Coordinator
Wrangles team members during an incident. Not responsible for troubleshooting/fixing the problem but instead recording events for later analysis and helping with communication. An ongoing major incident MUST have an incident coordinator. Sometimes known as Incident Commanders in other organizations. Major incidents must only ever be declared by a human, and are considered ongoing until declared closed by the coordinator.
Incident report
A technical documentation detailing the events of the incident for later review. The Incident Coordinator is usually in charge of collecting the information necessary to fill out the report (but may enlist the help of responders). Any incident must have a followup incident report.
Klaxon
The webpage for paging on-call SREs: A way to call for help. If nobody's on call, the Batphone will be paged (see definition).
OpenSearch
Visualizes logs and server events. Useful for seeing more specific details of how the applications are running. Compare with Grafana.
Puppetboard
A web front-end to show all Puppet facts, catalogues, etc. Useful for determining the status of Puppet on various servers.
requestctl
A command-line tool to control access and routing of web requests, mostly throttling and blocking of certain requests patterns (e.g. bad actors).
Runbook
A pre-written set of instructions applicable for a specific type of incident. Useful for accurately executing recovery without so much overhead of thinking through actions. Also known as Playbook in other organizations.
Server Admin Log
Page of logs when performing actions worth noting. SREs can browse noteworthy events when troubleshooting.
Splunk On-Call
The cloud-service system we use for alerting on-call SREs. Sometimes known as VictorOps (the name of the company before it was acquired). Yes, it's confusing.
Slack
Generally used by non-technical roles at WMF but serves as back-up incident coordination in the event of IRC unavailability (In the #sre-incident-response channel)
Superset
Useful for analysis of webrequests/live traffic. Good for getting a view of how traffic is impacting WMF.
wikimediastatus.net
The public-facing website where we convey outages or significant user-impacting events.
Make sure you have access to each of these tools before you need to use them!

Incident action guidelines

Priority

During an incident, SRE members assume roles and begin work to restore services: Restoring services is more important than fixing the underlying issue. In general, the process is to:

  • Start operational communication immediately, deferring the rest;
  • Apply speculative or temporary fixes before a full diagnosis is made;
  • Defer analysis of root causes after the site is back up.

To quote the Google SRE book:

Novice pilots are taught that their first responsibility in an emergency is to fly the airplane; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground. This approach is also applicable to computer systems: for example, if a bug is leading to possibly unrecoverable data corruption, freezing the system to prevent further failure may be better than letting this behavior continue.

Response

At the start of a possible incident, no definite assumptions can be made about who's available for what, as incidents can happen at any time, and can be triggered or responded to by anyone.

Responsibilities
First responder
  • Acknowledge the alert in Icinga/Alertmanager and explicitly mention arrival on IRC (#wikimedia-operations or #mediawiki_security).
  • Handle basic communication until others arrive.
  • Hand off communication and paging for help to someone else if present.
  • Explicitly decide who will take this communication role before joining the investigation once more responders arrive.

In the event of insufficient help/backup, call Mark, Faidon, or Greg (phone numbers can be found on the private Contact list). Escalating is never a problem when needed. If you're not sure, escalate.

Decide if you are going to be an Incident coordinator or not. If yes, follow the procedures to become IC. From here on out, there are two divergent paths that must work together to resolve the problem. Read up on What the IC does and What everyone else does.

Diagnose the problem

Responsibilities
Incident Coordinator SRE
  • Create a live incident document and fill out the metadata. Post it to the active IRC channel.
  • Decide whether to notify the larger WMF organization (e.g. comms).
  • Decide whether to escalate for further help.
  • Create/maintain the public-facing status page.
  • Create/update ticket related to the incident.
  • Review monitoring tools to diagnose the issue.
  • Communicate diagnostic actions to IC so that they can coordinate efforts.
  • Maintain the bigger picture, avoiding deceptive and unrelated symptoms and root causes

The Four Golden Signals are Latency (request timing), Traffic (system demand), Errors (rate of failure), and Saturation (constrained resources). Alternatives include the USE method and the RED method, both of which are succinctly detailed and nitpicked in true computer-geek pedantry in this Grafana blog post. While those are good indicators to research, a more general approach is to unearth what the system is doing, determine why the system is behaving like that, and then locating where this is happening. Familiarize yourself with these techniques before any incidents to increase your effectiveness in response.

Other useful starting points include:

  • The server admin log: Most outages are caused by human error, so seek out risky change logs.
  • Alert runbooks and metrics: Much of WMF's monitoring directly tells us the problem and some even link to the runbook. Keep an eye out for these details in the alerts.
  • Superset dashboards can visualize webrequest-sampled logs for traffic inspection.
Remember, keep the IRC channel informed with what you're checking!

Fixing the problem

The workflow for recovery involves:

  1. Formulating an educated solution for immediate remediation or mitigation of the incident;
  2. Notifying (or confirming with, depending on the risk) your team about the up-coming action;
  3. Logging your up-coming action in #wikimedia-operations connect with !log [message] [ticket], where [ticket] is what the IC has created during the incident;
  4. Applying the action;
  5. Informing the team of the action's progress until completion;
  6. Regrouping with the team for analysis of the action's effect on the system and restarting if necessary.
Responsibilities
Incident Coordinator SRE
  • Evaluate risks of proposed actions. Ensure that SREs understand these risks.
  • Keeping the status document updated with actions
  • Update the public-facing status page.
  • Keep other WMF teams updated with the status and estimated timeline of recovery attempts.
  • Keep the IC up-to-date with proposals to restore services or any needs to progress recovery
  • Apply technical changes to restore services
  • Log actions in #wikimedia-operations

Review

This page may be outdated or contain incorrect details. Please update it if you can.

Are these tags still relevant? Why do many of them sound the same?

Once services are stabilized and the incident is concluded, now is the time to start digging into locating/fixing the underlying issue.

  1. Open Phabricator tasks for any open or follow-up items; when you do, please tag with #SRE-OnFire and any of the following pertinent tags:
    • #wikimedia-incident-actionable
    • #wikimedia-incident-follow-up
    • #wikimedia-incident-prevention
    • #incident-followup
    • #wmde-incident-followup

Finally, fill out as much as possible with the data fresh and available in the Incident Metadata section and Incident Scorecard, assessing briefly how things went during the incident.

Practice

This section could use a little bit more focus on real-life events. Please expand with historical examples of what we've had to deal with to better prepare the reader.

There are often no "right answers": The responses only exist as potential starting points for a meaningful resolution.

Response

You are invited to an event that coincides with your on-call rotation. How could you handle this?

You could plan ahead for this event: Bring a laptop, ensure that you're able to receive pages while at the event, and reserve a quiet place for work in the event of a page.

Alternatively, you could ask a fellow SRE to trade shifts.

Preparation ahead of time relieves your fellow SREs alert fatigue; Getting paged for every incident sucks.

You've acknowledged an incident and have an idea of the problem but are unfamiliar with the technology. Ideas?

You could ask if someone is familiar with the tech on IRC. If someone responds, you could still be helpful by assuming IC. You could also escalate with Klaxon.

You might feel pretty uneasy at this point but remember, panicking doesn't help the situation. Take a moment to get your composure if you have to.

You volunteer to be IC. How do you proceed?

Once the channel has been informed that you're IC, remember your duties: Disengage from troubleshooting and engage in understanding the bigger picture. Start a status document in accordance with the style du jour (as of 2023 it's been changing around a lot). Ask questions: What's happening? Why is this happening (if that's known)? How is this affecting users?

Write down each responder's actions and the times. Keep bugging the responders if they're not conveying information.

If the incident is affecting users, update the status page. If the incident ends up spanning a longer time, keep the status page updated and the incident doc clear (the responders/IC may need to hand off to another person).

You cannot connect to IRC. It appears to be down. What do you do?

Connect to the #sre-incident-response channel in Slack.

Debugging

A recent deploy has triggered an alert for a spike in 500s. What next?

Deployers are advised to stick around, so reach out on #wikimedia-operations connect. Depending on the issue at hand, it might be worth rolling back. Remember, your goal here is to restore services as quickly as possible, so it might be worth fixing the broken production code. Speak up and determine the quickest way forward.

Users are reportedly unable to log in.

If unfamiliar, figure out what manages user logins by searching WikiTech or asking on IRC. Check Grafana for metrics related to your findings, such as the dashboard for SessionStorage. Maybe see if someone on the data-persistence team is around for expertise. Check the SAL to see if someone's been messing with those servers. Search Logstash's homepage for SessionStore and peruse the logs there for clues.

Fixing

Puppet agents are not functional and a change needs deploying

Manually deploying the change might be necessary if services are critical: If manually deploying:

  • Log your next action into #wikimedia-operations connect with !log <message>
  • SSH into the cumin host
  • Issue a cumin command that will manually fix the issue
Your issue can be solved but will require some time to apply

Your first priority is to restore services over fixing issues. If possible, fail over to redundant services to restore services quicker.

A few IPs have been heavily scraping Wikipedia; Traffic links are maxed out

Your first priority is to restore services. Blocking the IPs based on unique identifications (e.g. IP addresses, unique user agents) with requestctl is an immediate solution to return saturation/utilization to nominal.

See also