SLO/Template instructions/Dashboards and alerts
SLO Windows
Past experience says that dashboards and alerts need to be configured since day one to guarantee that the SLO's error budget events are properly followed up. Google's SRE Workbook dedicates an important section to choosing the right time window, describing two broad categories:
- Rolling window - A dynamic "rolling" time range from X days ago to 'now'.
- Fixed window - A "fixed" calendar date range from Day X hour 00:00:00 to Day Z hour 23:59:59.
What are practical use cases for these different types of windows? From the Google SRE Workbook:
SLOs can be defined over various time intervals, and can use either a rolling window or a calendar-aligned window (e.g., a month). There are several factors you need to account for when choosing the window.
Rolling windows are more closely aligned with user experience: if you have a large outage on the final day of a month, your user doesn’t suddenly forget about it on the first day of the following month.
Calendar windows are more closely aligned with business planning and project work. For example, you might evaluate your SLOs every quarter to determine where to focus the next quarter’s project headcount. Calendar windows also introduce some element of uncertainty: in the middle of the quarter, it is impossible to know how many requests you will receive for the rest of the quarter. Therefore, decisions made mid-quarter must speculate as to how much error budget you’ll spend in the remainder of the quarter.
Shorter time windows allow you to make decisions more quickly: if you missed your SLO for the previous week, then small course corrections—prioritizing relevant bugs, for example—can help avoid SLO violations in future weeks.
Longer time periods are better for more strategic decisions: for example, if you could choose only one of three large projects, would you be better off moving to a high-availability distributed database, automating your rollout and rollback procedure, or deploying a duplicate stack in another zone? You need more than a week’s worth of data to evaluate large multiquarter projects; the amount of data required is roughly commensurate with the amount of engineering work being proposed to fix it.
We have found a four-week rolling window to be a good general-purpose interval. We complement this time frame with weekly summaries for task prioritization and quarterly summarized reports for project planning.
At the WMF we offer both.
SLO Alerting & Operations - The Rolling Window View
How to monitor an SLO?
The Google SRE workbook outlines a method called burn rate alerting (for more info please read the "Alert on Burn Rate" section).
The idea is to have alert(s) that look back in time (from the current moment), fetching the amount of error budget being burned (its rate), and try to predict the same amount of time in the future. Depending on how fast the error budget is being burned, an alert can be raised.
For example, imagine we have a time window of 3 hours (1.5h of look back, and 1.5h of future prediction). If the burn rate is 0.7, then at the end of the window 70% of the error budget will be gone. No alert needs to be raised here. If the burn rate was 1.4, then by the end of the window 140% of the error budget would be gone. In this case, an alert needs to be raised.
SRE provides multiple standard alerts based on this concept applied to the rolling window, taking this approach a step further. The approach is called Multi-burn rate alerting.
Multi-burn rate alerting follows the example provided above, across multiple time windows to implement a range of alert severities. For example, if the budget would be exhausted in 15 minutes, this is a critical issue. If the budget would be exhausted in 15 days, this is a warning.
What happens if an event burns the entire error budget (or a sizeable chunk of it) related to the rolling window? The rolling window represents only an approximation, a smaller time window compared to the quarterly/fixed one, and hence the error budget available is less/proportional compared to the overall one. For SLO purposes you should always consider the error budget remaining in the quarterly window, and use the rolling window's alerts as a day-to-day operational indication of how the error budget's consumption is trending.
Rolling Window Tools
The open source project Pyrra is our chosen tool for technical SLO definition, rolling window SLO views and management of SLO alerting rules. It powers the https://slo.wikimedia.org portal, which provides rolling window dashboard visualizations for all SLOs. These visualizations are linked to, from the text of an alert, as SLO burn rate alerts fire.
Furthermore, there are:
- Centralized SLO definitions and alerts as code, defined in code, currently via the Puppet repository.
- A set of configurable SLO burn rate alerts to monitor the error budget of the rolling window. These are implemented using Alertmanager
- A metrics feed that, rendered by Grafana, for longer term fixed-window views.
SLO Reporting - The Fixed Window View
To provide high-level views of SLO performance over time, calendar-based fixed window dashboards are available in Grafana.
These dashboards provide (by default) a visualization of the current, and recent quarters. These implement an SLO reporting and longer term review interface.
Behind the scenes, this reporting view brings together the results of multiple shorter windows (e.g. 4 week) described above. This is done primarily for two reasons:
- First, for practical purposes it is not advisable to create alerts or perform daily operations based on fixed time windows. Operators need a real-time view to reason about SLO impacting events that are happening up to the current moment.
- Second, computing SLOs directly over long periods of time (e.g. quarters) has proven to be very expensive computationally, causing issues with performance and reliability and causing significant system load on the Prometheus infrastructure. It is more efficient to compute a shorter window and combine the results when displaying longer time spans.
We approximate one month's windows with a single "rolling" one. We then apply alerts that tell us at what rate the error budget is being burned, so that an incident is promptly recognized and worked on.
Fixed Window Tools
Grafana hosts a pair of dashboards (SLO list, SLO details) which provide fixed-window SLO reporting.
- The SLO Quarterly Review dashboard., which allows the high level overview
- A dedicated, per SLO, dashboard in the SLO Quarterly Drilldown dashboard. This represents the calendar window for a given SLO.
Pyrra - Pyrra provides the metrics feed that powers the Grafana dashboards above, allowing customizable fixed-window views
Caveats
This section represents a collections of know caveats and frequent doubts that SRE has been having while drafting this page, together with their explanations.
Rolling window overlapping two quarters
What happens with a one month rolling window during the first month of the calendar window? If we are during day two, we'll calculate the alerts based on the previous quarter's last 28 days, that may include past outages that are not anymore in our error budget. This can be seen in multiple ways, but one of it is that we don't have a clear cut between the past and the present quarters: if we had outages that burned a lot of error budget in the past window we should be mindful in the new one as well, even if we'd be entirely free to dispose of the error budget as we prefer. An alert would be raised to inform about the situation (to all stakeholders involved), and in this case it could be ignored because of a deliberate willingness to burn the error budget. The good aspect about it is that the outage wouldn't go unnoticed!
Big outages disappearing from the rolling window
Let's imagine a rolling window of one month, and a big outage that happens at the beginning of the calendar window burning out 80% of the error budget. At some point during the quarter we'll see the rolling window not considering the big outage anymore, showing up a much healthier error budget. What value should we care about? On one side, we have the error budget related to the calendar window, spanning 3 months and counting the big outage. On the other side, we have a smaller window of one month that looks mostly healthy. The two views may seem in contrast with each other, but they are complementary and they represents different point of views for the same problem. The calendar window is the target of the SLA, so every team commits to do work based on the error budget trend over a quarter. In our example, the big outage needs to be taken into consideration when approaching any risky work during the rest of the quarter (like deploying new features, performing upgrades, etc..). The rolling window tells you only how your monthly error budget is going (and remember, it is proportional to the calendar one - just a slice of it), to assist your judgment for more day-to-day operations (small bug fixes, service restarts for security upgrades, etc) and to promptly alert you if an outage is in progress.
What error budget should I pay attention to?
The major and most important difference between calendar and rolling window is that we chose to set the SLA based on the calendar window. This means that the rolling window is exclusively for day to day operations, whereas the calendar view should be used to take decisions about the next quarters.
The underlying idea though is that the rolling window's error budget and the calendar window's one should be both used when taking decisions regarding a certain SLO. To help clarifying the doubts that may arise, here some common use cases that an SLO owner may get into at any point during a quarter:
| Rolling Window Budget | Calendar Window Budget | Operational Mode | Recommended Actions |
|---|---|---|---|
| High (> 50%) | High (> 50%) | Normal Operations |
|
| Low (< 50%) | High (> 50%) | Cautious Operations - Recent Reliability Issues |
|
| High (> 50%) | Low (< 50%) | Cautious Operations - Plan Reliability Improvements |
|
| Low (< 50%) | Low (< 50%) | Emergency Mode - Prioritize Reliability |
|
How to read a Pyrra dashboard
Please check SLO/FAQ#How to read a Pyrra dashboard?
Approvals & Contact information
For help and support at any point in the lifecycle of an SLO (e.g. onboarding, steady-state, changes, offboarding) please create a task for the SRE-SLO project in Phabricator.