SLO/Template instructions
A service level objective (SLO) is an agreement, among everyone who works on a service, about how reliable that service needs to be.
Without such an agreement, different people may have different implicit notions of how much latency is tolerable, how many errors justify getting someone out of bed, or how much instability justifies rolling back a release. By defining a quantitative, measurable set of targets, and then measuring our performance against those targets, we ensure that we're on the same page.
At the same time, it allows teams to reason about the services they depend on. Suppose a frontend promises 100 ms responses at the 95th percentile, but it passes each request to a backend promising 500 ms at the 95th. The SLOs are incompatible -- the frontend can't reasonably promise to respond faster than the backend! But in the absence of explicitly-agreed SLOs, such mismatches in implicit expectations are common.
Thus an SLO also mediates the relationship between services. It's an agreement among the multiple teams invested in a service, but also a promise to the teams that depend on them. That means it's important that each SLO is consistently met. We should only miss an SLO due to truly unexpected circumstances -- and if that happens, we take it seriously and prioritize reliability fixes over other work, so that it doesn't happen again.
An SLO is only valuable if it can be meaningfully relied on; an SLO that's regularly not upheld is worse than none at all. Therefore, in writing your SLO, err on the side of caution. Set unambitious targets that you know you'll be able to meet right away, and plan to tighten them later if you like -- rather than failing to meet your initial goals and loosening them until they're achievable.
Getting started
Make a copy of SLO/Template for your service, and fill it out section by section. This page has details on how to address each of the questions in the template -- each section will take some research, discussion, and sometimes negotiation.
The next pages in this guide will walk you through addressing each of the questions in the template. Each section will take some research, discussion, and sometimes negotiation. Throughout, when you see the đ symbol, that's the prompt to return to your SLO document and write up your findings.
There are seven parts, to complete in order:
- Organizational
- Architectural
- Client-facing
- Service level indicators
- Operational
- Service level objectives
- Finalizing
(You might like to skim them all before getting started.)
Now, move on to the first section.
References
- Jones, Wilkes, and Murphy with Smith, "Service Level Objectives" in Site Reliability Engineering, O'Reilly 2016 (free online)
- Alex Hidalgo, Implementing Service Level Objectives, O'Reilly 2020 (WMF Tech copy)