Incident response

From Wikitech
Jump to navigation Jump to search

During an incident, SRE members assume roles and begin work to restore services: Restoring services is more important than fixing the underlying issue. In general, the process is to:

  • Start operational communication immediately, deferring the rest;
  • Apply speculative or temporary fixes before a full diagnosis is made;
  • Defer analysis of root causes after the site is back up.


As soon as multiple responders or team members are available, team members are assigned roles.

Incident Coordinator
As incidents scale in severity and in size of response, a concern becomes communication and coordination. The Incident Coordinator (IC) coordinates the work of others, asks questions, maintains status documents, and handles communications. Becoming an IC is recommended when other SREs are more intimately familiar with the failing systems.
ICs ask SREs to perform tasks that are deemed important. They also escalate to other team members when insufficient expertise is represented.
Incidents longer than a few hours typically result in IC handoffs to prevent mental exhaustion.
Site Reliability Engineer
The SRE responds to the Incident coordinator in order to achieve the immediate goal of resolving the incident. Technical activities such as forensics, disaster mitigation, and recovery are performed by the SRE. SREs keep the Incident Coordinator informed of actions and system statuses.

Team communication

Team communication is a vital part of incident response as it eliminates duplication of effort, offers peer-review of actions, helps with incident reports, and eases hand-offs in the event of a multi-hour incident.

General discussions occurs on IRC, in channels #wikimedia-operations connect or (if sensitive topics) #mediawiki_security connect. Manual actions/interventions are logged to Server Admin Log using the !log keyword in IRC #wikimedia-operations channel. For sensitive/security-sensitive steps of response to any incident the !log-private command is used in #mediawiki_security.

The IC may nag the other team members for clarification in order to keep the status document up to date.