Splunk On-Call

From Wikitech

Splunk On-Call (formerly VictorOps) is the paging/notification/engagement solution used by SRE, WMCS and others (since June 2020).

How to

Page all SREs (Batphone)

Create Incident Splunk On Call
Create Incident Splunk On Call

For emergencies, there are a few ways to page SREs and escalate. Klaxon is your shortest path to getting help from someone on call. However, there are cases where you may need to page all SREs regardless of on-call status. This requires Splunk On-Call access and a valid WMF email account.

create incident example for SRE batphone
create incident example for SRE batphone
  • Log in with
  • Proceed to click the Create Incident
  • Fill out the form with the Batphone escalation policy.
  • Create Incident

Set up as a new user

You have received an invitation from Splunk On-Call. At the invitation stage you will be asked for a few information: your VO username, password and “displayed name”. Additionally a phone number for SMSes, although that can be safely skipped and can be added later.

Logging in with Google SSO

Sign out and select “sign in via SSO” on the login page.

  • Next, you will be prompted to enter your Org Slug, enter ‘wikimedia'
  • From this page you will be redirected to sign in using your wikimedia.org Google credentials.
  • After entering Google credentials, you will be asked to enter your current Splunk On-Call (VictorOps) username/password. Note: You will only need to enter your Splunk On-Call username and password once to link the account with SSO, then it will not be asked for again.

Additional information about Splunk On-Call SSO can be found at https://help.victorops.com/knowledge-base/single-sign-sso/

Set up personal paging policies

Each user can configure their preferred notification methods by clicking the username on top right and “your profile”. The “primary paging policy” will default to the email you used at registration time, and optionally the phone number if provided.

Make sure you hear the notifications even if your phone is in do not disturb mode

Android: https://help.victorops.com/knowledge-base/android-devices-victorops/

Team specific details

Please review the team specific details and setup steps in their respective sections below, and perform those which apply to you.

VO Administration

Invite a new user

At user onboarding time, you (an admin on VO) will receive a request to invite a new user (usually via phab task).

  1. Navigate to https://portal.victorops.com/dash/wikimedia#/users and hit "invite user", using the user's full wikimedia.org email address for invitation.
  2. After the invitation has been sent, the user needs to be added to a team. Therefore navigate to https://portal.victorops.com/dash/wikimedia#/team-schedules and pick a team, then invite the newly-created user to the team.
  3. Give the user "Team Admin" privileges for relevant teams. To do this hit the pencil button for the user's row and hit confirm.

Removing a user

To remove a user, first remove them from any rotations or escalation policies they may be a part of. This typically can be done by removing them from any teams they have been added to.

note: If the above step is not done you may see an error to the effect of "We were unable to delete the user, please try again or contact support"

SRE Team Usage

Adding yourself to the batphone

The SRE "all hands on deck" model is referred to as "batphone" and its schedule can be found under Teams -> SRE -> On-Call Schedule. During onboarding please follow the steps below to add yourself to the batphone:

Note: if you run into any permission errors in the process, please confirm with a VO admin that you have "team admin" permission.

  1. Navigate to SRE rotations
  2. For the "batphone" rotation, expand by clicking the caret on the right, select "add a shift" (bottom left) and pick "partial day" from the dropdown
  3. In the next form, "shift name" is your Full name (one shift per person)
  4. Click "monday through friday" and select all seven days of the week. Pick the desired hours (e.g. based on Icinga "awake hours"), note that these times are relative to "time zone above" in the form.
  5. Click "save shift"
  6. You’ll be shown the rotation with the new empty shift added. Click the leftmost icon to "manage members" for the shift and add your username.
  7. Done!

More information can be found at the VictorOps knowledge base.

Business Hours Pager Shift

Business hours paging is configured under Teams -> SRE -> Rotations.

There are two "Business hours" rotations defined, one for each region (EMEA, and AMER) with two "pools" per region. These pools (region-day-pool1 and region-day-pool2) contain the same people within each region, however their ordering is staggered in order to automatically rotate through the roster evenly.

Note: There is no notion of primary or secondary between region-day-pool1 and region-day-pool2, they are both treated with equal priority. Pages are routed to all pools simultaneously.

Viewing the business hours pager schedule

Here are Three methods to view the upcoming pager schedule:

Starting your shift

  1. Open the Splunk on-call app on your mobile device and ensure your authentication is active, this will speed up acknowledgement of alerts.
  2. Ensure the time zone and business hours for the pool you are representing this week reflects your current local time zone and hours. Under Teams -> SRE -> Rotations expand your business hours region, then identify the pool which reflects your pager shift for the upcoming week (your name will appear in the time bar unless you've swapped shifts, in which case the name of the person originally on-duty will appear) and click the pencil to edit the shift
  3. Within the edit window double check that the time zone and hours reflect the business hours for your location, specifically the "Time Zone" and "Each user is on duty" fields. Adjust as necessary, then click "save shift". You can check under the "On-Call Schedule" tab that the time is now set correctly (and that any overrides have worked, if you swapped shifts). Note that it's not possible to set a start time in the past (if your current time is 08:31 you can't set a start time of 08:30, you need to set it to 09:00).

Acknowledging a page

You can acknowledge a page from the web interface, the application, or from IRC.

To acknowledge a page from IRC, in the #wikimedia-sre channel:

# Get the list of incidents from sirenbot
# Acknowledge the incident
!ack $incident_id

Escalating a page

If you receive a page and need help, do not hesitate to escalate it to the batphone.

In VictorOps / Splunk on-call this type of escalation is done with the "add responders" feature.

  1. In the VictorOps / Splunk on-call interface, navigate to the incident (alert) that you wish to escalate.
  2. Find the "Responders" section, and click "add responder" (or the + icon on mobile)
  3. From "Escalation Policies" choose "Batphone" from under the "SRE" heading, and click next.
  4. Review for accuracy and press save.

The system will now trigger the batphone paging policy, paging the broader SRE team for assistance.

Scheduling an override (out of office, on vacation, etc)

If you will be unavailable during a scheduled pager shift, here's what to do:

  1. Schedule an override, either from the app (calendar tab) or SRE scheduled overrides (or your team's scheduled overrides)
  2. Once the override is set, navigate to scheduled overrides link above and expand your newly added override, you will see a breakdown of the pager rotations and escalations needing coverage.
  3. Populate the user field to reflect who will be taking on the affected pager shifts:
    • SRE Business hours shifts:
      • Choose the person who will be taking the shift for you. Note: it is preferred to arrange coverage ahead of time when possible, however leaving this field blank will trigger an "unassigned overrides" notification to VO admins prompting managers to fill in coverage.
      • NB that you need to set the override for both your regional shift (Business Hours EMEA or Business Hours Americas) and SRE Business Hours (Escalation)
      • If you're on vacation when you don't have a Business hours shift, use "Dev Null" for the override.
    • SRE Batphone
      • If you will be unavailable (out of office, on vacation, etc.) choose "devnull" as the overriding person for the batphone shift, all alerts to that contact are effectively blackholed.
      • If you will be available, but are arranging alternate coverage for the business hours pager, choose yourself in order to continue receiving batphone alerts.

Automatic unpaged alert re-routing (klaxon/victorops.py escalate_unpaged)

VO escalation policy currently lacks an "if no members are presently on call in the current step, immediately proceed to the next step" option. To work around this, a klaxon/victorops.pyscript has been deployed to poll the victorops/splunk on-call API for alerts which have not yet paged anyone, and immediately re-route them to the batphone pager rotation.

This script runs every 15s from a systemd timer called 'vo-escalate' on the alerting hosts, e.g. alert1001 (included via profile::klaxon)

WMCS Usage

The Cloud Services team uses a separate set of rotations and gets paged in somewhat different ways due to the size of the group and tech involved. The focus is on ensuring alerts reach the most prepared people to resolve them at times that are least disruptive to daily life where possible. This was deemed necessary partly because Cloud Services has a lot of systems that merit paging, those systems should only alert the WMCS team, and some of the alerts are fairly easy to trip and hard to disable during changes.

We do two 12h shifts, zone1 from 03:00UTC to 15:00UTC and a complementary zone2.

There are three "rotations" defined:

  • Oncall: This rotation is split in two shifts (zone1 and zone2). With two pools of engineers per shift an only one of the engineers oncall at a time. The engineer oncall for both shifts changes on Wednesday. If that engineer does not ack the alert, it escalates to the next rotation.
  • Everyone awake in the zone: This rotation has one shift per engineer, where the engineer is oncall only for the zone they decide to be in. This effectively pages all the engineers in the active timezone when the alert happens. If none of the engineers acks the alert, it escalates to the next rotation.
  • Everyone: This rotation includes all WMCS engineers in 24/7 shifts. If an alert reaches this rotation after escalating from the previous rotations, every WMCS engineer will receive a page regardless of their chosen zone.

The flow is, only a single engineer for the current timezone is paged first, if they don't ack, all the engineers from that zone get paged, and if nobody acks, all engineers get the page.

The rationale for this setup is described at Decision_record_T310598_Team_oncall_alerting_schedules_and_processes

Note: The devnull user should work for WMCS overrides/vacations as well.

How we use it

FIXME: TODO - Document what stuff makes it into VictorOps, and how or through what (Icinga? Prometheus? Puppet?)

Codesearch: [vV]ictor[oO]ps