Incident response/Process improvement/Definition of an incident
Definitions
Alert
An alert is any notification from Icinga or other automated monitoring tool. It may be simply reported on IRC, or file a ticket in Phab. It might also be a page.
The word alarm is usually synonymous with alert in this context; to avoid confusion, it should not be used.
Page
A page is an alert deemed serious enough for the automation to potentially wake up humans (currently implemented as sending SMS).
Incident
An incident is an outage, security issue, or other operational issue whose severity demands a human response. Any incident MUST have a followup incident report.
Incidents will not always begin with a page, but a page MUST always open an incident.
Major incident
A major incident is an incident of such a severity that it demands "all hands on deck": a DoS attack, attempts to hack into production, wikis being inaccessible in Europe, etc.
An ongoing major incident MUST have an incident coordinator.
Major incidents MUST only ever be declared by a human, and are considered ongoing until declared closed by the coordinator.
Incident Coordinator
As incidents scale in severity and in size of response, the most important concern becomes communication and coordination. This is the responsibility of the incident coordinator, who SHOULD NOT directly undertake technical measures to help resolve the incident; rather, their responsibility is to coordinate the work of others, ask questions, document what is being done, communicate status with other teams and staff, and ensure that the right people to do so are communicating externally with the community / with the world.
In an incident longer than a few hours, the title of coordinator SHOULD be handed off between people. There MUST always be an active coordinator until the incident is sufficiently mitigated to no longer require a large response team.
(TODO: link to incident severity list)
Guidelines
Anyone on the SRE team, or who is otherwise involved with the operations of WMF sites (e.g. Security team, Performance team, Mediawiki deployers, etc), may declare an incident, if they believe they have discovered a problem severe enough to merit such. They should then pull in others to assist in responding, including manager(s).