SRE/Oncall

From Wikitech
< SRE

Being on-call means you may be paged anytime during your shift to investigate and fix issues in our production environment. You are expected to be able to respond to issues with a reasonable response time (less than five minutes) during your business hours. Pages that go unacknowledged for longer than five minutes will proceed to the “bat phone” rotation to ensure issues are triaged promptly if all on-call engineers cannot respond.

During this time, you will be expected to take whatever necessary actions to triage, coordinate with subject matter experts, restore services to an operational state, and bring the incident to resolution.

When on call, you will play one of two roles; an incident responder or an incident coordinator. You can read more about the Incident Coordinator and Responder roles at Incident response.

The goal is to reduce overall page noise to the larger group by virtue of this exercise and minimize widespread disruption as reasonably as possible (if you need help, please escalate as required). You have agency in executing your shift and coordinating your work; please make reasonable adjustments with this goal in mind.

Preparing for your shift

  1. Have your laptop and Internet with you (office, home, wifi, etc.). No additional accommodations should be required outside your regular work environment, as this on-call rotation happens during business hours.
  2. If you have a compatible phone, you can install the app:
  3. Be prepared (environment is set up, a current working copy of the necessary repositories is local and functioning, you have configured and tested your credentials for third-party services and production servers, etc.)
  4. See who else will be sharing the on-call shift with you via the Splunk On-Call Calendar.
  5. Read our incident response documentation to understand how we handle serious incidents, the different roles during an incident, methods of communication available, etc.
  6. Browse recent incidents and familiarize yourself with what recent problems have occurred.  
  7. Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments, etc.

When your shift begins, ask yourself:

  • Who is my on-call partner?
  • Who are the people in the previous shift?
  • Are there any active incidents?
  • Are there any prior incidents that may reoccur?

On-call acknowledgement

In terms of waking hours the most common setting is 8am-midnight local time; However, some elect to participate in wider notification windows (such as 24 hours).

Pages are routed first to the business hours on-call rotation. If outside of business hours, pages are automatically escalated to the batphone rotation for immediate triage. Batphone is the nickname given to our page-everyone-who-is-awake rotation. Pages which arrive after business hours or that go unacknowledged for more than five minutes are sent to the batphone for immediate triage.

Should you find yourself unavailable to locate your on-call partner, please contact your manager to help coordinate someone to assist you.

If you need to be unavailable for a short period during on-call and your manager has not identified someone to help, please reach out to SRE and ask someone to “cover” for you and create an override. If that is not possible, please create an empty override for your unavailability. This will trigger a notification for managers and Leo to address.

Handoff

When your on-call shift ends, let the next on-call team know about:

  • Any issues that have not been resolved yet and any notable incidents or pages during your shift. This will help the next team prepare.
  • Any changes impacting the production or needing monitoring; Raise these with the next on-call region and document how to supervise these changes.
  • Any current pages and information shared.

Business hours

From the Wikipedia article on Business hours:

Business hours are the hours during the day in which business is commonly conducted. Typical business hours vary widely by country. By observing common informal standards for business hours, workers may communicate with each other more easily and find a convenient divide between work life and home life.

Typically, this will mean roughly between 7 am - 5 pm (with -/+ 1-2 hours of variability on each end), Monday through Friday or Sunday through Thursday (depending on your country of employment), and which side of the world you are geographically located during on-call shift.

Business hours are country-specific, so your country may differ from other countries’ practices. Variations for business hours between countries may occur in days of the week, start/end of Business hours, or length of a business day; please ensure you are adhering to your country’s direction on business hours.

You may notice in Splunk On-Call (formerly VictorOps) we have initially configured the on-call business hours schedule to  ~7 am - 5 pm (UTC in EMEA and Eastern Time for the Americas); this setting is an initial calendaring template; please update your hours in Splunk On-Call to match your timezone and business hours.

Your business hours may not match precisely with your on-call partners’ business hours but will largely overlap; please make adjustments to schedule any errands your overlap with your on-call partner to ensure coverage.

Your work schedule may also be slightly different from traditional business hours; in those cases, please discuss the best course of action with your manager.

Change of working hours

If you're on the road or otherwise altering your usual working patterns, don’t panic! The goal of on-call paging is to reduce noise (within reason) for the broader team and provide coverage as much as possible during business hours.

If your business hours change a bit, that's fine; please communicate that with your manager and your on-call partner.

If you switch continents or countries, please work with your manager to create an override for your shift and coordinate a swap. Depending on your stay in the new location, a switch to the on-call pool for your continent may also be coordinated.

Being on call doesn’t mean you need to tether yourself to your computer.  It does mean that during these weeks, you might have to plan on being interrupted to respond to and address issues as they arise. There will be good weeks, and there will certainly be bad weeks, but don't be scared; you’re not alone; there is an entire team to support you.

Vacations, holidays, and sick days

You have agency in executing your shift and coordinating your work; if you need to be out due to life events (planned or not), please work with your manager (or reach out to User:LMata) to help coordinate someone to swap shifts, if it is short notice please also notify your on-call partner so they can be aware of the situation.

No one expects you to work while sick; if you are sick before or during your shift, please communicate with your manager to find someone to take over your shift and on-call partner for awareness; if all people in a rotation happen to be indisposed, the Bat phone paging rotation will be used as a fallback.

Creating an empty override in Splunk on-call with your “unavailable time window” will automatically generate a notification.

Communication when IRC is down

If for any reason Libera or our IRC network of choice at the moment should be unavailable or down, Please discuss in the #sre-incident-response Slack channel.

See also

Incident response