This training attempts to expose the reader to functional, direct information for effective incident response. This training attempts to bake thought patterns for a more prepared response. The reader is assumed as willing (rather than forced) to train since they will be thrown into the fire eventually.
- Clarify expectations of roles and response.
- Establish a baseline of best practices, tips and tricks during an incident.
- Optimize response efforts by drilling standard operating procedures.
- On-board new on-callers using practical experiences.
Scope of responders
Everyone's priority is to respond to the crises at hand, but each person's responsibilities will vary depending on the role they assume. Until others arrive, you are also responsible for basic communication. Others typically arrive very soon afterwards, at which point these responsibilities can change.
Other organizations have more granular response roles, but WMF employs two: Incident Coordinator and Site Reliability Engineer.
- Incident Coordinator
- The Incident Coordinator (IC) maintains strong communication between all responders. They ensure all engineers are actively reporting their work; They ask pertinent questions about the incident from an high-level perspective (other engineers are digging deep); They assess the technical capabilities of the team, escalating to other engineers more familiar with the affected systems; They keep the status document up-to-date, providing a memory of the incident for later review.
- Site Reliability Engineer (SRE)
- This role does not differ much from the usual SRE role in WMF: The only difference is that the IC becomes their coordinator as a response to an immediate goal rather than general system improvements dictated by managers. encompasses technical activities such as forensics, disaster mitigation, and recovery. SRE communications must keep the Incident Coordinator abreast of current activities for effective coordination between the incident response team members.
There are endless Wikimedia platforms/services but the following are the most pertinent:
- The next generation alerts dashboard. Powered by Karma. On its way to replacing the Icinga web interface. Any day now.
- A term used to mean "Page all SREs". "Batphone" is on-call when no-one else is scheduled.
- A tool for automating tasks across many servers. Part of a larger ecosystem (Spicerack and Cumin have more in-depth explanations but aren't as relevant here). Similar in practice to automation frameworks like Ansible and its playbooks.
- Visualizes metrics. Useful for overviews of what's happening in the WMF stack, be it number of requests sent to specific applications, the CPU usage of a particular server, etc. Compare with OpenSearch.
- Monitoring system for all Wikimedia services and servers. Automatically sends pages out when critical issues are detected. On its way to being replaced by a similar solution, Prometheus/AlertManager and the alerts.wikimedia.org frontend for interactive dashboards.
- Incident Coordinator
- Wrangles team members during an incident. Not responsible for troubleshooting/fixing the problem but instead recording events for later analysis and helping with communication. An ongoing major incident MUST have an incident coordinator. Sometimes known as Incident Commanders in other organizations. Major incidents must only ever be declared by a human, and are considered ongoing until declared closed by the coordinator.
- Incident report
- A technical documentation detailing the events of the incident for later review. The Incident Coordinator is usually in charge of collecting the information necessary to fill out the report (but may enlist the help of responders). Any incident must have a followup incident report.
- The webpage for paging on-call SREs: A way to call for help. If nobody's on call, the Batphone will be paged (see definition).
- Visualizes logs and server events. Useful for seeing more specific details of how the applications are running. Compare with Grafana.
- A web front-end to show all Puppet facts, catalogues, etc. Useful for determining the status of Puppet on various servers.
- A command-line tool to control access and routing of web requests, mostly throttling and blocking of certain requests patterns (e.g. bad actors).
- A pre-written set of instructions applicable for a specific type of incident. Useful for accurately executing recovery without so much overhead of thinking through actions. Also known as Playbook in other organizations.
- Server Admin Log
- Page of logs when performing actions worth noting. SREs can browse noteworthy events when troubleshooting.
- Splunk On-Call
- The cloud-service system we use for alerting on-call SREs. Sometimes known as VictorOps (the name of the company before it was acquired). Yes, it's confusing.
- Generally used by non-technical roles at WMF but serves as back-up incident coordination in the event of IRC unavailability (In the
- Useful for analysis of webrequests/live traffic. Good for getting a view of how traffic is impacting WMF.
- The public-facing website where we convey outages or significant user-impacting events.
Incident action guidelines
During an incident, SRE members assume roles and begin work to restore services: Restoring services is more important than fixing the underlying issue. In general, the process is to:
- Start operational communication immediately, deferring the rest;
- Apply speculative or temporary fixes before a full diagnosis is made;
- Defer analysis of root causes after the site is back up.
To quote the Google SRE book:
Novice pilots are taught that their first responsibility in an emergency is to fly the airplane; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground. This approach is also applicable to computer systems: for example, if a bug is leading to possibly unrecoverable data corruption, freezing the system to prevent further failure may be better than letting this behavior continue.
At the start of a possible incident, no definite assumptions can be made about who's available for what, as incidents can happen at any time, and can be triggered or responded to by anyone.
In the event of insufficient help/backup, call Mark, Faidon, or Greg (phone numbers can be found on the private Contact list). Escalating is never a problem when needed. If you're not sure, escalate.
Decide if you are going to be an Incident coordinator or not. If yes, follow the procedures to become IC. From here on out, there are two divergent paths that must work together to resolve the problem. Read up on What the IC does and What everyone else does.
Diagnose the problem
The Four Golden Signals are Latency (request timing), Traffic (system demand), Errors (rate of failure), and Saturation (constrained resources). Alternatives include the USE method and the RED method, both of which are succinctly detailed and nitpicked in true computer-geek pedantry in this Grafana blog post. While those are good indicators to research, a more general approach is to unearth what the system is doing, determine why the system is behaving like that, and then locating where this is happening. Familiarize yourself with these techniques before any incidents to increase your effectiveness in response.
Other useful starting points include:
- The server admin log: Most outages are caused by human error, so seek out risky change logs.
- Alert runbooks and metrics: Much of WMF's monitoring directly tells us the problem and some even link to the runbook. Keep an eye out for these details in the alerts.
- Logs can explain any irregularities in Grafana metrics (WMF does not yet have tracing, sadly).
- Superset dashboards can visualize webrequest-sampled logs for traffic inspection.
Fixing the problem
The workflow for recovery involves:
- Formulating an educated solution for immediate remediation or mitigation of the incident;
- Notifying (or confirming with, depending on the risk) your team about the up-coming action;
- Logging your up-coming action in #wikimedia-operations connect with
!log [message] [ticket], where
[ticket]is what the IC has created during the incident;
- Applying the action;
- Informing the team of the action's progress until completion;
- Regrouping with the team for analysis of the action's effect on the system and restarting if necessary.
Are these tags still relevant? Why do many of them sound the same?
Once services are stabilized and the incident is concluded, now is the time to start digging into locating/fixing the underlying issue.
- Open Phabricator tasks for any open or follow-up items; when you do, please tag with
#SRE-OnFireand any of the following pertinent tags:
Finally, fill out as much as possible with the data fresh and available in the Incident Metadata section and Incident Scorecard, assessing briefly how things went during the incident.
There are often no "right answers": The responses only exist as potential starting points for a meaningful resolution.
You could plan ahead for this event: Bring a laptop, ensure that you're able to receive pages while at the event, and reserve a quiet place for work in the event of a page.
Alternatively, you could ask a fellow SRE to trade shifts.
Preparation ahead of time relieves your fellow SREs alert fatigue; Getting paged for every incident sucks.
You could ask if someone is familiar with the tech on IRC. If someone responds, you could still be helpful by assuming IC. You could also escalate with Klaxon.
You might feel pretty uneasy at this point but remember, panicking doesn't help the situation. Take a moment to get your composure if you have to.
Once the channel has been informed that you're IC, remember your duties: Disengage from troubleshooting and engage in understanding the bigger picture. Start a status document in accordance with the style du jour (as of 2023 it's been changing around a lot). Ask questions: What's happening? Why is this happening (if that's known)? How is this affecting users?
Write down each responder's actions and the times. Keep bugging the responders if they're not conveying information.
If the incident is affecting users, update the status page. If the incident ends up spanning a longer time, keep the status page updated and the incident doc clear (the responders/IC may need to hand off to another person).
Connect to the
#sre-incident-response channel in Slack.
Deployers are advised to stick around, so reach out on #wikimedia-operations connect. Depending on the issue at hand, it might be worth rolling back. Remember, your goal here is to restore services as quickly as possible, so it might be worth fixing the broken production code. Speak up and determine the quickest way forward.
If unfamiliar, figure out what manages user logins by searching WikiTech or asking on IRC. Check Grafana for metrics related to your findings, such as the dashboard for SessionStorage. Maybe see if someone on the data-persistence team is around for expertise. Check the SAL to see if someone's been messing with those servers. Search Logstash's homepage for SessionStore and peruse the logs there for clues.
Your first priority is to restore services over fixing issues. If possible, fail over to redundant services to restore services quicker.
Your first priority is to restore services. Blocking the IPs based on unique identifications (e.g. IP addresses, unique user agents) with requestctl is an immediate solution to return saturation/utilization to nominal.