SRE/Oncall
Being on-call means you may be paged anytime during your shift to investigate and fix issues in our production environment. You are expected to be able to respond to issues with a reasonable response time (less than 30 minutes) during your shift. Pages that go unacknowledged for longer than 30 minutes will proceed to the “bat phone” rotation to ensure issues are triaged promptly if all on-call engineers cannot respond.
During this time, you will be expected to take whatever necessary actions to triage, coordinate with subject matter experts, restore services to an operational state, and bring the incident to resolution.
When on call, you will play one of two roles; an incident responder or an incident coordinator. You can read more about the Incident Coordinator and Responder roles at Incident response.
The goal is to reduce overall page noise to the larger group and minimize widespread disruption as reasonably as possible (if you need help, please escalate as required). You have agency in executing your shift and coordinating your work; please make reasonable adjustments with this goal in mind.
Preparing for your shift
- Have your laptop and Internet with you (office, home, wifi, etc.)
- If you have a compatible phone, you can install the app:
- Be prepared (environment is set up, a current working copy of the necessary repositories is local and functioning, you have configured and tested your credentials for third-party services and production servers, etc.)
- See who else will be sharing the on-call shift with you via the Oncall-optimizer Schedule.
- Read our incident response documentation to understand how we handle serious incidents, the different roles during an incident, methods of communication available, etc.
- Browse recent incidents and familiarize yourself with what recent problems have occurred.
- Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments, etc.
When your shift begins, ask yourself:
- Who is my on-call partner?
- Who are the people in the previous shift?
- Are there any active incidents?
- Are there any prior incidents that may reoccur?
On-call acknowledgement
Pages are routed first to the 247 on-call rotation. Pages which go unacknowledged for more than 30 minutes are sent to the batphone for triage. Batphone is the nickname given to our page-everyone-who-is-awake rotation.
Should you find yourself unavailable to locate your on-call partner, please contact your manager to help coordinate someone to assist you.
If you need to be unavailable for a short period during on-call and your manager has not identified someone to help, please create a swap request for the time range in question and if urgent reach out to SRE and ask someone to “cover” for you and make arrangements to to reassign part of your shift. If that is not possible, please create an empty override for your unavailability and contact your manager. This will trigger a notification for managers.
Handoff
When your on-call shift ends, let the next on-call team know about:
- Any issues that have not been resolved yet and any notable incidents or pages during your shift. This will help the next team prepare.
- Any changes impacting the production or needing monitoring; Raise these with the next on-call region and document how to supervise these changes.
- Any current pages and information shared.
- Any incidents documented as per the incident runbook.
247 Shifts
WMF 24/7 on-call policy is covered by the SRE/Management agreement document.
Schedule Swaps
Typically (given reasonable lead-time) on-call shift swaps can be self served using the request swap feature within oncall-optimizer
In cases where a swap is needed more quickly, reach out to your manager
Schedule
There are two people on-call at any given time.
Shift cycles run 04:00 UTC Thursday to 04:00 UTC Thursday in the European winter (shifts follow London timezone). Shifts are (based on current staff numbers):
20-04 UTC (Americas staff)
04-12 UTC (EET staff; CET staff)
12-20 UTC (WET staff; CET staff; one Americas staff member)
These times move with the European daylight savings (London specifically) to remain the same wall-clock time, i.e. in European DST period they are:
19-03 UTC (Americas staff)
03-11 UTC (EET staff; CET staff)
11-19 UTC (WET staff; CET staff; one Americas staff member)
With current staffing levels, this means that each staff member will be on-call about 8 times a year (roughly every 6.5 weeks). Distribution to be reviewed annually in line with staff distribution.
Vacations, holidays, and sick days
You have agency in executing your shift and coordinating your work; if you need to be out due to life events (planned or not), please work with your manager to help coordinate someone to swap shifts, if it is short notice please also notify your on-call partner so they can be aware of the situation.
No one expects you to work while sick; if you are sick before or during your shift, please communicate with your manager to find someone to take over your shift and on-call partner for awareness; if all people in a rotation happen to be indisposed, the Bat phone paging rotation will be used as a fallback.
After your shift
As per the on-call agreement, staff who attend to out of hours shift coverage are entitled to compensation. The agreement states that:
- The base compensation for a standard shift is 1.5 days of time off. For non-standard shifts, the time off is calculated at 33% of total numbers of hours covered outside of working hours.
- Staff responding to incidents outside of working hours are entitled to 2 hours of time off per hour worked or part thereof.
- Staff who are on-call during a public holiday in their country of residence receive a whole day off per day on-call.
For more specifics of how hours are calculated, please see the agreement or ask your manager.
To ensure that the impact of on-call is properly addressed by management and ICs, please remember to fill in the post-shift check-in form after every shift.
Communication when IRC is down
If for any reason Libera or our IRC network of choice at the moment should be unavailable or down, Please discuss in the #sre-incident-response Slack channel.