Incident response/Incident severities
Appearance
(Redirected from Incident response/Incident Severities)
In order to understand the impact of an incident, SRE classifies incidents with a severity level. There are three severity levels in use by SRE, all prefixed by "Sev" with 1 being the most severe and 3 the least severe. These levels indicate differing levels of impact and different levels of warranted response.
When an incident document is opened, the IC should attempt to assign a severity level soon after filling in other basic details. The severity level of an incident can change over time, either as a result of mitigation or as a result of reassessment of the incident as more information emerges.
Sev1
Impact
- Data is being lost (user edits, uploads)
- Stored data is completely unavailable or corrupted
- All wikis are down for all users
- The active write datacentre is down
- All wikis are unintentionally read-only
- All wikis are inaccessible to users
- Security compromise of or unauthorised access to critical infrastructure
- Sustained high-impact abuse on-wiki by malicious actors that requires a coordinated response
- Widespread data or privacy breach
Examples
- A widespread traffic event that makes the wikis unavailable
- A primary MariaDB master is down in a core section
- A severe corruption event has occurred that necessitates a backup to be restored
- An attacker is somehow executing code on MediaWiki containers
- A significant increase in save failures
- A significant percentage of requests are returning 5xx errors
- An attacker has gained unauthorised access to user data through an exploit
Response
- An incident document is created
- An incident coordinator (IC) is assigned
- Wikimedia Status is updated unless this is unsuitable.
- If not yet involved, on-call SREs should be made aware and (dependent upon working hours etc) all available SREs (and, if required, engineers from outside of SRE) check in to coordination channels to contribute.
- Awareness should be spread to those working at VP or director level to help coordinate a cross-foundation response.
- The IC should communicate outside of SRE if need be, but ideally communications for an incident of this severity should involve a response outside of SRE.
- If the page is out-of-hours, the SRE managers rotation should be paged as soon as is possible
Sev2
Impact
- The wikis are down for a subset of users (all users served by a specific caching POP, or all sites in a database section)
- Wiki request volume has dropped below an acceptable level
- Wiki latency has increased above an acceptable level
- Wiki error rate has increased above an acceptable level
- Urgent response required to major security vulnerability
Examples
- An individual caching POP is saturating and causing slowdowns or failures for all or most users in a region
- A zero day vulnerability has been released for a component that receives user traffic requiring a speedy response
- We are serving an alert-worthy level of 500 errors for a sustained period of time
- Scrapers significantly impairing performance of APIs or public facing endpoints
- A consistent increase in save failures
Response
- An incident document is created
- An incident coordinator is assigned
- Wikimedia Status is updated (if suitable)
- If not yet involved, on-call SREs should be made aware
- Any available help be made aware of the situation should they be required.
- At the discretion of responding ICs and in the case of a long-running or complex Sev2, the additional communications process outlined in Sev1 can be invoked
Sev3
Impact
- An individual service has a sustained and increased error rate, or is impairing individual features over a significant period of time
- An internal service is in a degraded state that heightens risk
- Other minor incidents that warrant an incident process
Examples
- Thumbnailling of some images is failing
- Replication is failing between etcd mirroring
- Wikifeeds has an elevated error rate
- External factors causing aberrant traffic patterns that have the potential to become impactful
Response
- An incident document is created if necessary
- An incident coordinator is assigned and the document is created
- If needed, Wikimedia Status is updated.