User:BCornwall/Incident action items risk assessment

From Wikitech

Wikimedia must maintain control and visibility over open/closed incident follow ups to reduce risk of unaddressed, open items. Risk scoring for unaddressed incident follow-up items and periodic risk review are part of the incident review ritual.

This will encourage us to:

  • Follow up on forgotten action items
  • Raise concerns over action item neglect and the damage that may result
  • Promote accountability in ownership of action items
  • Assign previously-unclaimed action items

Risk factors

Severity
The amount of damage caused to systems if the item is not addressed.
  1. Marginal - Risks may cause minor damage but little overall effect
    • Minor performance/error concerns that would remain within SLA/budget
    • Non-user-impacting degradation of service
    • Negligible impact on systems (e.g. unhelpful log messages)
  2. Serious - Risks may cause major damage
    • User-facing service degradation/outages
    • Major performance/error concerns that exceed budgets
  3. Catastrophic - Risks will cause extensive damage and long-term effects to systems
    • Extended user-facing service outage
    • Systems breach
    • Leak of sensitive data
Probability
The likelihood that the related incident could occur again if the item is not addressed.
  1. Possible - Not expected to occur
  2. Probable - May occur
  3. Certain - Expected to occur eventually

Risk assessment systems often have around five rankings of Severity/Probability; However, to limit the possibility of subjective variance for scoring (See #Limitations of risk matrices) we utilize three. Three rankings grant us flexibility to prioritize action items without getting lost in semantics/difference of opinions.

Risk matrix

Probability Severity
Marginal Serious Catastrophic
Certain Medium High Unbreak Now!
Probable Low Medium High
Possible Low Low Medium

Limitations of risk matrices

From What's wrong with risk matrices? by Louis Anthony Cox Jr:

Categorizations of severity cannot be made objectively for uncertain consequences. Inputs to risk matrices (e.g., frequency and severity categorizations) and resulting outputs (i.e., risk ratings) require subjective interpretation, and different users may obtain opposite ratings of the same quantitative risks. These limitations suggest that risk matrices should be used with caution, and only with careful explanations of embedded judgments.

Risk scoring matrices are a tool for somewhat-standardized evaluation of at-risk actionables but still require scrutiny for priority of an engineer's time.