VictorOps is the paging/notification/engagement solution used by SRE, WMCS and others (starting ~June 2020).
Set up as a new user
You have received an invitation from VictorOps (VO in short). At the invitation stage you will be asked for a few information: your VO username, password and “displayed name”. Additionally a phone number for SMSes, although that can be safely skipped and can be added later.
Set up personal paging policies
Each user can configure their preferred notification methods by clicking the username on top right and “your profile”. The “primary paging policy” will default to the email you used at registration time, and optionally the phone number if provided.
[SRE team]: Add yourself to the batphone
The current SRE paging model is referred to as "batphone" and its schedule can be found under Teams -> SRE -> On-Call Schedule. Follow the steps below to add yourself to the batphone:
- Get any existing member of the team to make you a team admin.
- Navigate to SRE rotations
- For the "batphone" rotation, expand by clicking the caret on the right, select "add a shift" (bottom left) and pick "partial day" from the dropdown
- In the next form, "shift name" is your Full name (one shift per person)
- Click "monday through friday" and select all days of the week. Pick the desired hours (e.g. based on Icinga "awake hours"), note that these times are relative to "time zone above" in the form.
- Click "save shift"
- You’ll be shown the rotation with the new empty shift added. Click the leftmost icon to "manage members" for the shift and add your username.
More information can be found at the VictorOps knowledge base.
Set yourself on vacation
When you are on vacation and on one of the "batphone" rotations (aka "everyone awake gets a page") then see the following steps to set yourself on vacation:
- Schedule an override, either from the app (calendar tab) or SRE scheduled overrides (or your team's scheduled overrides)
- Once the override is set, navigate to scheduled overrides link above and expand your newly added override
- Set the "devnull" contact as the overriding person, all alerts to that contact are effectively blackholed.
Make sure you hear the notifications even if your phone is in do not disturb mode
Invite a new user (VO admins)
At user onboarding time, you (an admin on VO) will receive a request to invite a new user. Navigate to https://portal.victorops.com/dash/wikimedia#/users and hit "invite user", using the user's full wikimedia.org email address for invitation. After the invitation has been sent, the user needs to be added to a team. Therefore navigate to https://portal.victorops.com/dash/wikimedia#/team-schedules and pick a team, then invite the newly-created user to the team. For most teams the user will need to be a team admin as well: to do this hit the pencil button for the user's row and hit confirm.
The Cloud Services team uses a separate set of rotations and gets paged in somewhat different ways due to the size of the group and tech involved. The focus is on ensuring alerts reach the most prepared people to resolve them at times that are least disruptive to daily life where possible. This was deemed necessary partly because Cloud Services has a lot of systems that merit paging, those systems should only alert the WMCS team, and some of the alerts are fairly easy to trip and hard to disable during changes. There are three "rotations" defined for each team member:
- Working hours: This is the engineer's primary working schedule on weekdays. This is when most of our pages come in due to higher rate of changes on the systems, and this ensures that people who are working can take care of things without disturbing engineers who are not working in their own timezone.
- Awake hours: 6am to 10pm in the local timezone.
- All hours: 24x7
The "Working hours" rotation will page immediately to ensure that those who are on duty and most ready to help with things are informed of the issue. If no one has acknowledged an incident for 10 minutes, the alert is escalated to the "awake hours" rotations. This does imply there is a 10 minute delay for paging on weekends, but emails go out instantly. If an alert is still unacknowledged for 15 minutes, the "all hours" escalation is triggered, which will page the entire team. Because email alerts go out right away, people already at a computer can intercept and help out more quickly. This makes sure that someone will always be paged, one way or another, but it allows us to simulate some of the best parts of a "follow the sun" model of support without the need to actually have people everywhere.
The devnull user should work for WMCS overrides/vacations as well.
How we use it
FIXME: TODO - Document what stuff makes it into VictorOps, and how or through what (Icinga? Prometheus? Puppet?)