Incident response/In-depth

From Wikitech
Jump to navigation Jump to search

If you're in the middle of an incident, see the quick reference instead.

Incident resolution has three elements:

  1. Communicate
  2. Diagnose
  3. Fix

These three can and should be interleaved:

  • Start operational communication immediately, defer the rest.
  • Speculative or temporary fixes can be applied before a full diagnosis is made.
  • Analysis of root causes should generally be done after the site is back up.

Who does what

As first person arriving (on IRC) to respond to an incident, explicitly mention your presence on IRC in the #wikimedia-operations connect channel.

Until others arrive, you are also responsible for basic communication. Others typically arrive very soon afterwards, at which point these responsibilities can change.

As soon as multiple responders or team members are available, hand off communication and explicitly assign roles before anyone joins the investigation (see also #When complexity demands explicit coordination: incident coordinators). If you need specific individuals or team members to be paged to help you, ask someone else to page them if possible so that you can stay focussed on the investigation.

If you are the person receiving a page and deciding to handle it, remember to explicitly acknowledge the Icinga alert from This causes a message to go out and signal others that the problem is being worked on.

Don't worry about escalating. If you're not sure, escalate!


  • Operational communication (server admin log, IRC, etc.) must start immediately.
  • Communication with the rest of the organisation and the public should start within 5 minutes.

Peer communication and logging

It's absolutely essential that you communicate your actions to other SREs as you do them. Here are some reasons:

  • It avoids duplication of effort, conflicts over text file edits, etc.
  • It avoids confusing other SREs about the causes of the site changes that they observe. It is difficult enough to diagnose the cause of downtime. If an engineer changes something, and another erroneously attributes the results, then that can significantly slow the diagnosis process.
  • Bus factor. If you say what you are doing, other SREs have a chance of continuing your work should you lose internet connectivity.
  • Sanity review. Responding to site downtime is a high-stress activity and is prone to errors. By writing about your actions and your thoughts, you give others the chance to review and comment on them.
  • It makes post-mortem analysis possible. If actions are unlogged, then reconstructing the order of events becomes very difficult. If you hinder post-mortem analysis, then you make it more likely that the same problem will happen again.

General discussions / synchronisation should occur on IRC, in channels #wikimedia-operations connect or (if sensitive) #mediawiki_security connect. Also use #mediawiki_security connect for pasting logs that might contain PII (e.g. un-redacted Apache request logs).

Manual actions/interventions should be logged to the Server Admin Log using the !log keyword in IRC, #wikimedia-operations connect channel:

!log Restarted Varnish backend instance on cp1065

For sensitive incidents, or security-sensitive steps of response to any incident, use the !log-private idiom in #mediawiki_security connect:

!log-private applying router ACL to blackhole foonet

When complexity demands explicit coordination: incident coordinators

As incidents scale in severity and in size of response, the most important concern becomes communication and coordination. One person working alone always knows what they're doing, and two people can communicate with each other, but as soon as three people are working together, it becomes possible to leave someone out.

When three or more people are actively working on an incident, someone should act as an incident coordinator, who SHOULD NOT1 directly undertake technical measures to help resolve the incident; rather, their responsibility is to coordinate the work of others, ask questions, document what is being done, communicate status with other teams and staff, and ensure that the right people to do so are communicating externally with the community / with the world.

When an automated page fires, many SREs will respond to it and descend on the incident simultaneously, and the response will need to be coordinated, even if the issue turns out to be straightforward. Thus, there should always be an incident coordinator when a page goes off, unless it's resolved right away, or it's immediately clear that it's a false alarm.

How to pick an incident coordinator

Anyone in SRE can step up and volunteer. But if you are an expert in something involved in the outage, it is likely better to let someone else do it – the coordinator can point out rabbit holes for others to investigate, but should not be the one diving into rabbit holes themselves.

In the event that a major incident is ongoing and no one volunteers, the duty falls upon SRE directorship to either perform the duty themselves, or volunteer someone else. In the event of an incident spanning multiple days, SRE directorship should help ensure that the workload of incident coordination is distributed across the team.

When possible, it’s preferable to favor an incident coordinator who is at the start of their day.

In an incident longer than a few hours, the title of coordinator SHOULD be handed off between people. There MUST always be an active coordinator until the incident is sufficiently mitigated to no longer require a large response team.

How to act as an incident coordinator

In general: Stay in touch with everyone who is working on response, and make sure they are actively reporting what they are doing on IRC.

If you think something needs to be done, or needs investigation, ask someone to do it. Don't be afraid to escalate to better-suited parties via phone, or ask others to make calls.

To those ends, a primary responsibility of the IC is to maintain the status document.

Maintaining the status document

Part of the role of coordinator is constructing and maintaining a status document that first and foremost communicates how our users, infrastructure, and services are currently affected, the name of the current incident coordinator, and a last-updated timestamp for those items. These items should be updated at least every half hour, and ideally whenever something major is discovered or fixed. Don't be afraid to nag those who are working on technical measures; this is a useful forcing function for deciding that one avenue of investigation is proving fruitless, or that it's time to come up with other ideas to stem the bleeding.

Beneath those items, the status document then contains the beginnings of a postmortem report. This means a timeline of the outage and remedial work performed (which at this stage is often just copy-and-pasted lines from IRC), any relevant log data or graphs, and a first pass at follow-up action items to be addressed later. It’s important to make sure that all links to dashboards have a fixed timespan set, to ensure they will show the same data when opened later on.

It's not just the coordinator's job to keep the document up-to-date; others working on the incident should help document.

In the case of a sensitive incident this document will live on Google Docs inside our team Drive. Even in the case of non-sensitive incidents, the status document will still be a Google Doc, as it may contain logs that are themselves sensitive/PII. (Some time after resolution, it will be cleaned up and become public at incident documentation.)

In either case, the incident coordinator should set the topic of the relevant IRC channel (#mediawiki_security connect or #wikimedia-sre connect) to include a very brief description of the incident, along with link to the document. Example:

ONGOING INCIDENT: esams unreachable (power outage?):

Communicating with the rest of the organisation

Besides diagnosing and fixing the problem as soon as possible (which is the highest priority), it's very important that for any outages that impact many users, a notification of the outage is sent to other parties within the organization, within the first 5 minutes. At that point, the Communications team, the Community teams and management may start receiving (press/phone/e-mail/social media) inquiries that need to be answered. Although accurate and complete information can be scarce in the early stages of some types of outages, at the very minimum a notification of the ongoing outage should be sent out, along with a brief indication of scope & impact where known. Technical details are not yet important at this point, and can change (drastically) as more information becomes available. Focus on what is known, and how it impacts (which) users. Keep it brief & quick, allowing you to focus on the investigation & fix.

More details on when and how to involve other teams in WMF is on officewiki, since it includes staff members' contact information.

When to notify

This process of notification should be done for severe outages that affect a significant amount of people, and need the Communications, Community teams and management to be aware. Examples of this are outages that affect the majority of site users (e.g. all wikis down), the majority of contributors (e.g. editing broken) or a big security breach that needs to be dealt with immediately. Smaller incidents that impact a rather limited amount of people or just a small subset of site functionality may not warrant paging other teams. There are no black & white rules for this, in the end it's a matter of judgement whether other teams in the organisation need to be aware and assist with followup. If in doubt, err on the side of yes.

After 15 minutes, if the outage is still ongoing, new update(s) should be sent with additional details, ideally including an ETA for a fix if available.

When service has been restored, a final update should be sent along with a brief description. More technical details will be provided later in the form of an incident report.


You have to be able to recognise when the problem is beyond your ability (or the ability of those people so far assembled) to fix alone.

Some issues require a lot of work to fix. For example, it takes a lot of work to recover from a power outage in a data center. In such a case, it makes sense to get everyone online from the outset.

Some issues require special expertise. For example, database crashes need Jaime to be online. Network failures need Faidon or Mark to be online.

Ask anyone online to assist with paging people, as it's typically the fastest way. If noone is available quickly, call Mark or Faidon to help with this.

If the site has been down for 15 minutes or more, it is time to stop working on the technical issues and to get some perspective. If a small team can't get the site back up in this amount of time, it has failed, and it is time to wake more people up, by calling them. Call, don't just text them, so you know whether they've received it or not. Primary phone numbers can be found on the private Office Wiki's Contact list, or, if that is down, in the Icinga configuration for paging in Puppet.


Diagnosis should always start by observing the symptoms.

Ganglia is by far the most useful and important diagnosis tool. Interpreting it is complex but essential. Request rate statistics (e.g. reqstats) are useful to get a feel for the scale of the problem, and to confirm that the user reports are representative and not just confined to a few vocal users. Viewing the site itself is the least useful diagnosis tool, and can often be left out if the user reports are clear and trustworthy.

Shell-based tools such as MySQL "show processlist", strace, tcpdump, etc. are useful for providing more detail than Ganglia. However, they are potential time-wasters. Unfamiliarity with the ordinary output of these tools can lead to misdiagnosis. Complex chains of cause and effect can lead responders on a wild goose chase, especially when they are unfamiliar with the system.

Failure modes

Fast fail

Requests fail quickly. Backend resource utilisation drops by a large factor. Frontend request rate typically drops slightly, due to people going away when they see the error message, instead of following links and generating more requests. Frontend network should drop significantly if the error messages are smaller than the average request size.

Example causes:

  • Someone pushes out a PHP source file with a syntax error in it
  • An essential TCP service fails with an immediate "connection refused"


This is the most common cause of downtime. Overload occurs when the demand for a resource outstrips the supply. The queue length increases at a rate given by the difference between the demand rate and the supply rate.

The growth of the queue length in this situation is limited by two things:

  • Client disconnections. The client may give up waiting and voluntarily leave the queue.
  • Queue size limits. Once the queue reaches some size, something will happen that stops it from growing further. Ideally, this will be a rapidly-served error message. In the worst case, the limit is when the server runs out of memory and crashes.

As long as the server does not have some pathology at high queue sizes (such as swapping), it is normal for some percentage of requests to be properly served during an overload. However, if queue growth is limited by timeouts, the FIFO nature of a queue means that service times will be very long, approximately equal to the average timeout.

There are two kinds of overload causes:

Increase in demand
For example: news event, JavaScript code change, accidental DoS due to an individual running expensive requests, deliberate DoS.
Reduction in supply
For example: code change causing normal requests to become more expensive, hardware failure, daemon crash and restart, cache clear.

It can be difficult to distinguish between these two kinds of overload.

Note that for whatever reason, successful, deliberate DoS is extremely rare at Wikimedia. If you start with an assumption that the problem is due to stupidity, not malice, you're more likely to find a rapid and successful solution.

Common overload categories

Somewhere in the system, a resource has been exhausted. Problems will extend from the root cause, up through the stack to the user. Low utilisation will extend down through the stack to unrelated services.

For example, if MySQL is slow:

  • Looking up the stack, we will see overloads in MySQL and Apache, and error messages generated in Varnish.
  • Looking down the stack, the overload in Apache will cause a large drop in utilisation of unrelated services such as search.

Varnish connection count

For the Varnish/nginx cache pool, the client is the browser, and disconnections occur both due to humans pressing the "stop" button, and due to automated timeouts. It's rare for any queue size limit to be reached in Varnish, since queue slots are fairly cheap. Varnish's client-side timeouts tend to prevent the queue from becoming too large.

Apache process count

For the Apache pool, the client is Varnish. Varnish typically times out and disconnects after 60 seconds, then it begins serving HTTP 503 responses. However, when Varnish disconnects, the PHP process is not destroyed (ignore_user_abort=true). This helps to maintain database consistency, but the tradeoff is that the apache process pool can become very large, and often requires manual intervention to reset it back to a reasonable size.

An apache process pool overload can easily be detected by looking at the total process count in ganglia.

Regardless of the root cause, an apache process pool overload should be dealt with by regularly restarting the apache processes using /home/wikipedia/bin/apache-restart-all. In an overload situation, the bulk of the process pool is taken up with long-running requests, so restarting kills more long-running requests than short requests. Regular restarting of apache allows parts of the site which are still fast to continue working.

Regular restarting is somewhat detrimental to database consistency, but the effects of this are relatively minor compared to the site being completely down.

There are two possible reasons for an apache process pool overload:

  • Some resource on the apache server itself has been exhausted, usually CPU.
  • Apache is acting as a client for some backend, and that backend is failing in a slow way.

Apache CPU

If CPU usage on most servers is above 90%, and CPU usage has plateaued (i.e. it has stopped bouncing up and down due to random variations in demand), then you can assume that the problem is an apache CPU overload. Otherwise, the problem is with one of the many remote services that MediaWiki depends on.

CPU profiling can be useful to identify the causes of CPU usage, in cases where the relevant profiling section terminates successfully, instead of ending with a timeout or other fatal error. Run /home/wikipedia/bin/clear-profile to reset the counters.

Note that recursive functions such as PPFrame_DOM::expand() are counted multiple times, roughly as many times as the average stack depth, so the numbers for those functions need to be interpreted with caution. Parser::parse() is typically non-recursive, and gives an upper limit for the CPU usage of recursive parser functions.

In cases of severe overload, or other cases where profiling is not useful, it is possible to identify the source of high CPU usage by randomly attaching to apache processes.

All our apache servers should have PHP debug symbols installed. Our custom PHP packages have stripping disabled. So just log in to a random apache, and run top. Pick the first process that seems to be using CPU, and run gdb -p PID to attach to it. Then run bt to get a backtrace. Here's the bottom of a typical backtrace:

#16 0x00007fa1230e85da in php_execute_script (primary_file=0x7fff8387eb10)
    at /tmp/buildd/php5-5.2.4/main/main.c:2003
#17 0x00007fa1231b19e4 in php_handler (r=0x13dd838)
    at /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c:650
#18 0x0000000000437d9a in ap_run_handler ()
#19 0x000000000043b1bc in ap_invoke_handler ()
#20 0x00000000004478ce in ap_process_request ()
#21 0x0000000000444cc8 in ?? ()
#22 0x000000000043eef2 in ap_run_process_connection ()
#23 0x000000000044b6c5 in ?? ()
#24 0x000000000044b975 in ?? ()
#25 0x000000000044c208 in ap_mpm_run ()
#26 0x0000000000425a44 in main ()

The "r" parameter to php_handler has the URL in it, which is extremely useful information. So switch to the relevant frame and print it out:

(gdb) frame 17
#17 0x00007fa1231b19e4 in php_handler (r=0x13dd838)
    at /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c:650
650	/tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c: No such file or directory.
	in /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c
(gdb) print r->hostname
$2 = 0x13df198 ""
(gdb) print r->unparsed_uri
$3 = 0x13dee48 "/wiki/%ED%8A%B9%EC%88%98%EA%B8%B0%EB%8A%A5:%EA%B8%B0%EC%97%AC/%EB%A7%98%EB%A7%88"

At the other end of the stack, there is information about what is going on in PHP. See GDB with PHP for some information about using it.

An extension to this idea of profiling by randomly attaching to processes in gdb is Domas's Poor Man's Profiler.

Slow backend service

If the site is down because a service that MediaWiki depends on has become slow, there are a number of tools that can help to identify the service:

High load or resource utilisation at the root cause server may be obvious at a glance on ganglia.
Run /home/wikipedia/bin/clear-profile, and then observe the highest users of real time .
This is useful for the most severe overloads. Log in to a random apache. Run ps -C apache2 -l and pick a process with a suspicious-looking WCHAN. Run lsof -p PID, then attach to it with strace -p PID. With luck (and perhaps some repetition), this will hopefully tell you what FD apache is waiting on. Using the lsof output, you can identify the corresponding remote service.

MySQL overload

MySQL overload can often be detected from Ganglia, by looking for an increase in load, or for anomalies in network usage and CPU utilisation.

Slow queries on MySQL typically lead to exhaustion of disk I/O resources. Fast, numerous queries may lead to overload via high CPU and lock contention.

Slow queries can be identified by running SHOW PROCESSLIST. If slow queries are identified as the source of site downtime, the immediate response should be to kill them. To do this, a shell/awk one-liner typically suffices, such as:

mysql -h $server -e 'show processlist' | awk '$0 ~ /...CRITERIA.../ {print "kill", $1, ";"}' | mysql -h $server

Once this is done and the site is back up, a secondary response can be considered, such as a temporary fix in MediaWiki by disabling the relevant module.

Monitoring of the number of running slow queries can be done with a related shell one liner:

mysql -h $server -e 'show processlist' | grep CRITERIA | wc -l

Important: disabling the source of slow queries in MediaWiki will typically not bring the site back up, if a large number of slow queries are queued in MySQL. That's one of the reasons why it's so important to kill first and patch second. Patching stops new queries from starting, it doesn't stop old queries from running.


Get your priorities straight. While the site is down, your priority is to get it back up. Do not let curiosity or a desire for a complete and elegant solution distract you from doing this as quickly possible.

Analysis of root causes can be done after the site is back up, based on logs. If you can't do it using the logs after the fact, then the logs aren't good enough and you should improve them for next time.


It's often overlooked that our server admin log is on a wiki. A nice way to start a postmortem is to add server admin log entries that were omitted at the time. Once you've reconstructed the order of events, with precise times attached, you can start looking at logs.

It's sometimes useful to test your theories about the root causes of downtime. If your theory about the root cause is incorrect, it means that the real root cause is still out there, waiting to cause more downtime. So there is a strong incentive to be rigorous.


1. ^ You SHOULD NOT, but, see also RFC 6919 section 1 for an insightful parenthetical.

See also