Wikimedia Cloud Services team/EnhancementProposals/Decision record T310598 Team oncall alerting schedules and processes

Date of the decision: 2022-06-29

People in the decision meeting (alphabetical order):

Decision taken

Option 1 was chosen, using 3UTC to 15UTC time bands. It was decided to move Bryan only to the last resort rotation, and the creation of shadow rotations for newcomers to learn and start helping with the oncall.

Problem

During infrastructure instability periods, the current oncall schedules and processes due to optimizing for breadth of attention, end up disturbing many people and not giving a clear way for action for those disturbed, relying on informal communication and channels.

Constraints and risks

We should try to wake up people as little as possible
We should try to interrupt people as little as possible

Options

Option 1

Create an "oncall duty" rotating role, where one team member for each time zone has the responsibilities during their rotation:

they are the first to get paged
their main responsibility is to deal with interruptions (pages, alerts, broken things...)
their second responsibility is to improve the oncall/alert (improve alerts, improve runbooks, cookbooks, add stability features to the system, ...)

This could be paired with the current "clinic duty".

Proposed schedule and zones:

Zone1: From 15:00 UTC to 3:00 UTC

Members: Andrew, Nicholas?, RooK

Zone2: From 3:00 UTC to 15:00 UTC

Members: Arturo, David

Rotating every week on Wednesday (team meeting day).

Alert duty gets first page 10 min after, the rest of the Zone gets paged 10 min after, the other Zone gets paged

Create an alerting best practices and move current alerts to it:

If it needs immediate attention -> page+task+email+irc
If it needs attention, but not immediate -> task+email+irc
If it does not need attention (essentially, for debugging/knowing the system status) -> irc or remove
Always create a runbook when you create an alert

Create also a common process for pages/alert handling:

When a page happens:
- If you don't have a laptop around yet and can act on it -> acknowledge on splunk/victorops
- If you have a laptop around and can act on it -> silence on the source (for now, that will become alertmanager)
  - If the alert has `source=icinga` on alertmanager -> ack on icinga, click on the "X hours ago" text on alertmanager to see the link (the wikimedia.org one):

- - Otherwise ack on alertmanager (the tick button, see below, or a silence that starts with 'ACK! ')

- If you can't act on it, let it page other people
- Do anything that needs doing to get rid of the urgency
  - If you need help, ping people you think might help, if not sure, ping another teammate
- Make sure it has an associated task and populate with what you found/did and next steps
- If the alert is not gone yet, ack the alert on alertmanager (not icinga) and attach the task id
- If you acked on victorops, resolve on victorops too (otherwise it will page again after 1d)

When an alert without a page happens:
- Make sure it has an associated task (this can be automated for alertmanager tasks)
- Ack on alertmanager (not icinga, only alertmanager) and add the task id to the ack comment

(this can be automated with a cookbook, something like cookbook wmcs.ack_page "subject of the page" or show a list of active alerts to choose or such)

Pros-cons

Pros:

This minimizes the interruptions for the rest of the team
This makes sure we invest into stabilizing/operationalizing the current systems
Every team member will have (limited) exposure to parts of the infrastructure they don't usually work on, increasing knowledge sharing
Lowers the total amount of out-of-hour pages
Makes it clearer where to look and how to communicate in case of a page

Cons:

The current size of the team makes it that sometimes we would be oncall half of the days (though opposed to all days as it is now I think it's an improvement)
Sometimes (specially at the beginning) the person paged might not know how to handle the page

Option 2

Do nothing.

Pros:

No new effort needs doing

Cons:

Nothing improves, but effectively deteriorates (alert fatigue, false alerts, false pages, ...)

Rationale

The optimal state would be when the alerts only page for real things to the right people, but given that it's a never reachable state, option 1 is the only one that gets us closer to it, in three ways:

Giving an extra level of paging to avoid paging the wrong people
Ensuring that there's some time and effort put on improving the current alarms/processes (by setting the oncall role)
Simplifying the ways that an alert has to be handled (by centralizing and moving them to alertmanager or creating wrapper scripts/tools, by the means of the extra time dedicated by the oncall role)

Currently I'm the only one in the UTC3 zone, but being paged during the day only is also an improvement over the current status. So we go with it for a couple months with me alone, until the team grows again (a newcomer joining already, and someone coming back from leave).