Jump to content

Incident response/Incident severities

From Wikitech

In order to understand the impact of an incident, SRE classifies incidents with a severity level. There are three severity levels in use by SRE, all prefixed by "Sev" with 1 being the most severe and 3 the least severe. These levels indicate differing levels of impact and different levels of warranted response.

When an incident document is opened, the IC should attempt to assign a severity level soon after filling in other basic details. The severity level of an incident can change over time, either as a result of mitigation or as a result of reassessment of the incident as more information emerges.

Sev1

Impact

  • Data is being lost (user edits, uploads)
  • Stored data is completely unavailable or corrupted
  • All wikis are down for all users
  • The active write datacentre is down
  • All wikis are unintentionally read-only
  • All wikis are inaccessible to users
  • Security compromise of or unauthorised access to critical infrastructure
  • Sustained high-impact abuse on-wiki by malicious actors that requires a coordinated response
  • Widespread data or privacy breach

Examples

  • A widespread traffic event that makes the wikis unavailable
  • A primary MariaDB master is down in a core section
  • A severe corruption event has occurred that necessitates a backup to be restored
  • An attacker is somehow executing code on MediaWiki containers
  • A significant increase in save failures
  • A significant percentage of requests are returning 5xx errors
  • An attacker has gained unauthorised access to user data through an exploit

Response

  • An incident document is created
  • An incident coordinator (IC) is assigned
  • Wikimedia Status is updated unless this is unsuitable.
  • If not yet involved, on-call SREs should be made aware and (dependent upon working hours etc) all available SREs (and, if required, engineers from outside of SRE) check in to coordination channels to contribute.
  • Awareness should be spread to those working at VP or director level to help coordinate a cross-foundation response.
  • The IC should communicate outside of SRE if need be, but ideally communications for an incident of this severity should involve a response outside of SRE.
  • If the page is out-of-hours, the SRE managers rotation should be paged as soon as is possible

Sev2

Impact

  • The wikis are down for a subset of users (all users served by a specific caching POP, or all sites in a database section)
  • Wiki request volume has dropped below an acceptable level
  • Wiki latency has increased above an acceptable level
  • Wiki error rate has increased above an acceptable level
  • Urgent response required to major security vulnerability

Examples

  • An individual caching POP is saturating and causing slowdowns or failures for all or most users in a region
  • A zero day vulnerability has been released for a component that receives user traffic requiring a speedy response
  • We are serving an alert-worthy level of 500 errors for a sustained period of time
  • Scrapers significantly impairing performance of APIs or public facing endpoints
  • A consistent increase in save failures

Response

  • An incident document is created
  • An incident coordinator is assigned
  • Wikimedia Status is updated (if suitable)
  • If not yet involved, on-call SREs should be made aware
  • Any available help be made aware of the situation should they be required.
  • At the discretion of responding ICs and in the case of a long-running or complex Sev2, the additional communications process outlined in Sev1 can be invoked

Sev3

Impact

  • An individual service has a sustained and increased error rate, or is impairing individual features over a significant period of time
  • An internal service is in a degraded state that heightens risk
  • Other minor incidents that warrant an incident process

Examples

  • Thumbnailling of some images is failing
  • Replication is failing between etcd mirroring
  • Wikifeeds has an elevated error rate
  • External factors causing aberrant traffic patterns that have the potential to become impactful

Response

  • An incident document is created if necessary
  • An incident coordinator is assigned and the document is created
  • If needed, Wikimedia Status is updated.