SLO/Runbook

From Wikitech
< SLO

Establishing a new SLO

Make a copy of the template and fill it in, following the instructions.

Updating an SLO

Modifying an SLO has similar coordination needs to writing a new one, in that it's important that affected teams be aware of the change, but the actual update is straightforward.

Tightening

Tightening an SLO means making it more restrictive, like raising an availability target or shortening a latency deadline. Adding a new SLI is also an example of tightening an SLO. Anyone who previously relied on your service will still be satisfied, but the new higher standard may require more effort to meet.

  • Get approval from all teams responsible for supporting the SLO. A tighter SLO means that engineers may be committing to do more response work, or deprioritize other goals in favor of operating to the new higher standard. They should have the opportunity to agree to that commitment, and they should also have a chance to concur with your assessment that the stricter SLO is feasible.
  • Check the SLOs of your service's dependencies, and ensure they can support the new SLO. Even if one of your dependencies has historically exceeded its written target, beware of assuming that will be true forever—unless their SLO is updated too.
  • For dependencies with no published SLO yet, ensure your new target is feasible by consulting with the team and reviewing the dependency's past performance.
  • Finally, edit the published SLO on Wikitech. If you've decided the change won't take effect immediately, leave the existing values as well; label the new values as aspirational and/or post the date it will become effective. It's often simplest to make SLO changes effective at reporting period boundaries: the first day of March, June, September, or December.

Loosening

By contrast, loosening an SLO means making it less restrictive, like lowering an availability target or lengthening a latency deadline. Removing an SLI completely is also an example of loosening an SLO. Your service no longer promises to uphold its previous standard along some dimension (even though it may now be more reliable in other ways).

  • Take this step with care and only when necessary. Remember that other teams may have designed their software around the assumption that yours will function as published; if they can't rely on your service as expected, they may have to do substantial engineering work to accommodate the change without adversely impacting their own users.
  • It may not be feasible to get approval from all the teams depending on you, but do inform them as early as possible, using any mailing lists or other means used to announce major system changes. Give them enough prior notice to make arrangements if needed.
  • Finally, edit the published SLO here on Wikitech. If you've decided the change won't take effect immediately, leave the existing values as well, and post the effective date of the change. It's often simplest to make SLO changes effective at reporting period boundaries: the first day of March, June, September, or December.

Other changes

Some SLI changes are neither a tightening nor a loosening, exactly: maybe you used to promise 300 ms latency at the 98th percentile, and now instead you promise 500 ms at the 99th (thus covering more requests but with a longer deadline). Or maybe you add a brand-new freshness guarantee at the expense of latency.

Use your best judgment; if you're not sure, it's safest to follow the coordination steps of both tightening and loosening the SLO, ensuring everyone upstream and downstream of your service is on board with the change.

List of responsible teams

The teams responsible for an SLO are any teams that might have to plan their work in order to meet its targets. Examples may include, but aren't limited to:

  • SRE teams that respond to alerts when the service is having problems.
  • Engineering teams (in Technology or Product, or both) that need to devote their time to meeting the reliability or performance demands of the SLO.
  • Product teams that need to reprioritize planned efforts when the SLO is violated and corrective action is needed.
  • Release Engineering teams that need to manage deployments without impacting the SLO and execute rollbacks when a bug in a new version threatens it.

All the responsible teams should be represented when a new SLO is agreed upon, so that they can agree to the commitments they're being signed up for.

When an SLO is missed

The SLO was written with the service’s expected performance in mind, so when it's violated, something unexpected must have happened, and corrective action is necessary.

Typically, the SLO violation was driven by one of two things:

  • The SLO wasn't met because of one or more major outages. Service was disrupted at a particular, identifiable time.
  • The SLO wasn't met because of a steady or semiregular level of errors or latency that exceed the set budget.

In the case of discrete outages, it's critical that the causal factors be identified. The SRE team usually accomplishes this by writing and discussing an incident report—this is how we collectively come to understand what went wrong, and why.

Major action items usually follow naturally: for example, the site was unavailable when a malformed configuration was rolled out, so more comprehensive automated tests should be added to the configuration system. Or the site was unavailable because a load spike was focused on a single point of failure, so that part of the architecture should be redesigned to eliminate the SPOF.

In the case of an elevated background level of errors, we acknowledge that this still represents some underlying change. No service ever performs at 100%, but if the normal level of errors has increased to the point where it exceeds the error budget on its own, then something has changed from its state at the time the SLO was written—a backend became flakier, or an occasional failure became commonplace, or a resource allocation is no longer sufficient—and either way, action items can be deduced just the same as in a sudden outage.

Either way, the effect of the SLO is to prioritize those action items. Often, a backlog of reliability improvements has sat waiting behind other work; the SLO represents a commitment to prioritize those tasks when it becomes necessary, and the SLO violation represents a signal that it has become necessary.

Even if it means delaying other planned improvements, our existing commitments to our users require that we divert at least some of our effort to accomplishing at least some of the action items highlighted by the SLO violation. The SLO reporting period ends a month before the calendar quarter, so teams have time to react, agree on what needs to be done, and adjust their planned work for the next quarter.

The other effect, when an SLO violation is observed before the end of the reporting period, is that the error budget has been completely exhausted. There's normally some allowance for imperfect deployments, risky production changes, and other sources of occasional errors, but now that margin has been consumed. Thus, for the remainder of the SLO quarter, and for as long thereafter as the underlying problem continues to threaten future performance, the service should be operated extremely cautiously and conservatively; for example, it may be necessary to postpone risky maintenance or even freeze feature deployments.

Finally, even though the SLO was written with the intent that it should be met every quarter, an SLO violation carries no particular moral valence—it's nobody's fault. Just as an incident analysis is designed to understand and address the causes of an outage without assigning blame for them, here too we can identify and address the causal factors that led us to miss a shared cross-team objective, rather than assigning fault to any individual or team.

Aspirational SLOs

Once an SLO is published, we're committed to meeting it: users, and client services, expect our services to perform up to the standard we've announced, and if they don't, they expect us to fix it.

Sometimes we're not ready for that yet. We know how reliable we want the service to be, but pending changes—engineering work, staffing, hardware procurement, or others—prevent us from committing to the new SLO for another few months. Other times, there's no specific blocking task, but we want to get more production experience after a major architecture change in order to build operational confidence with the new system.

When this happens, we can still publicize the new values as an aspirational SLO.

These values are for information only. We'll try our best to meet them, but we expect we might fall short, and that's okay. We keep an eye on our performance relative to the aspirational targets—that is, it's good to know whether we would have met the prospective SLO or not—but when we miss, we don't necessarily take corrective action.

SRE teams might, or might not, set paging thresholds based on aspirational SLOs. When the plan is to make it official soon, SREs may choose to switch on the pager early, in order to improve operational awareness—but if staffing constraints are the limiting factor, it may not be practical to commit to incident response. In either case, aspirational SLOs are clearly differentiated, and the major difference is that corrective action is not required when they're “violated.”

Eventually, an aspirational SLO becomes official (optionally after changes are made) or it's rolled back and removed.

Things we don't do: Intentionally burning error budget

As of this writing, we don’t intentionally burn error budget at the Wikimedia Foundation.

For most SLOs, the error budget is not to be exceeded. If an ordinary service targets 99.5% availability, but actually provides 99.8% in some quarter, that’s just fine! If it consistently over-delivers, it may mean that we could divert some resources to more important work, or deploy new features more aggressively, trading off that extra availability for improved velocity. But when we get lucky, that doesn’t constitute a problem to be solved.

But for some services, the story is different. Certain kinds of infrastructure should only ever be used as a soft dependency: in its absence, some functionality may be degraded but the user experience shouldn’t fail completely. A good example is etcd: it’s a good place to store global configuration, because its design chooses strong consistency over high availability. If etcd is unavailable, we can’t update those configuration values, but their cached values persist, and MediaWiki should still be able to serve wiki pages without depending on reading those values on every request.

In that sense, etcd can be an “attractive nuisance.” An engineer might decide to use etcd for something critical, not fully understanding its reliability characteristics, and so inadvertently introduce a hard dependency on a service that can’t support it. Worse, if etcd were to typically overperform its SLO, the situation could go unnoticed for a long time, but it’s a time bomb: eventually an etcd outage will come along—unsurprisingly, as reflected by the SLO—and create unanticipated levels of user impact.

In order to prevent this situation, Google’s Chubby SRE team (running a service functionally analogous to etcd) famously turns the service off briefly near the end of each quarter, burning off any remaining error budget in order to exactly hit the target value. This ensures nobody can depend on global Chubby’s high availability without soon discovering their mistake.

At the Foundation, we may eventually introduce something like this, but we have no plans to do so in the near or medium term. Consider it an “advanced use case” of SLOs; at a minimum, it relies on a more fully fleshed-out network of published SLOs and their dependencies, and on a more experienced culture of servicing and maintaining SLOs over a number of years.