Pyrra
Pyrra
Pyrra is a prometheus based SLO management and visualization tool. The upstream project is hosted at https://pyrra.dev
Our Pyrra deployment is accessible at https://slo.wikimedia.org, with redirects configured for https://slos.wikimedia.org and https://pyrra.wikimedia.org as well.
Architecture
There are three components of Pyrra
- The UI displays SLOs, error budgets, burn rates, etc.
- The API delivers information about SLOs from a backend (like Kubernetes) to the UI.
- A backend watches for new SLO objects and then creates Prometheus recording rules for each.
- For Kubernetes, there is a Kubernetes Operator available
- For everything else, there is a filesystem-based Operator available
Components
Today we make use of the UI/API and Filesystem components. These are installed using an in-house Debian package https://gitlab.wikimedia.org/repos/sre/pyrra and deployed/configured via puppet to the "titan
" (Thanos) hosts.
UI/API component
The UI/API component runs as pyrra-api
on "titan
" (Thanos) hosts, and serves the web UI seen at https://slo.wikimedia.org
The API service somewhat confusingly also connects to API endpoints provided by the backing operators, in our case filesystem operator.
Filesystem operator component
The filesystem operator runs as pyrra-filesystem
on the "titan
" (Thanos) hosts. This service processes SLO configuration files placed in /etc/pyrra/configs
and outputs recording/alerting rules to /etc/pyrra/output-rules
. /etc/pyrra/output-rules
in turn is read by the Thanos rule service on the active Thanos rule hosts. The filesystem service also provides an API which is used by the UI component.
Site local specifics
Grouping workaround
Pyrra has a grouping feature, where one SLO definition will be duplicated for each label found in the "group by" setting. However, using this setting disables the metrics that Pyrra itself emits relating to configured SLOs, and these metrics are used in turn to populate grafana detailed SLO dashboards.
A workaround to this has been set up in puppet, by essentially performing the grouping in puppet and letting puppet output a SLO config for each of the desired labels.
This has the added benefit of allowing us to explicitly include/exclude labels from the SLO, whereas the Pyrra grouping feature would automatically deploy new SLOs as soon as a new label is found. Which likely not a behavior we want.
Grafana dashboard for longer range and detail views
Grafana dashboards are deployed to the Pyrra folder and provide overview and detailed views.
These dashboards are based on prometheus metrics which are emitted and scraped from the Pyrra service itself.
https://grafana.wikimedia.org/dashboards/f/ea7f9f57-f24f-4f54-ae90-dd88ec7150a1/pyrra
Onboarding a new SLO with Pyrra
To add a new SLO to Pyrra:
Create a puppet patch. Pyrra SLOs are configured profile::pyrra::filesystem::slos puppet module.
Let's look at the Etcd SLO as an example
# Etcd requests/errors SLO
$datacenters.each |$datacenter| {
# Etcd is eqiad/codfw only
if $datacenter in [ 'eqiad', 'codfw' ] {
pyrra::filesystem::config { "etcd-requests-${datacenter}.yaml":
content => to_yaml({
'apiVersion' => 'pyrra.dev/v1alpha1',
'kind' => 'ServiceLevelObjective',
'metadata' => {
'name' => 'etcd-requests',
'namespace' => 'pyrra-o11y-pilot',
'labels' => {
'pyrra.dev/team' => 'serviceops',
'pyrra.dev/service' => 'etcd',
'pyrra.dev/site' => "${datacenter}", #lint:ignore:only_variable_string
},
},
'spec' => {
'target' => '99.9',
'window' => '12w',
'indicator' => {
'ratio' => {
'errors' => {
'metric' => "etcd_http_failed_total{code=~\"5..\",site=\"${datacenter}\"}",
},
'total' => {
'metric' => "etcd_http_received_total{site=\"${datacenter}\"}",
},
},
},
},
})
}
}
}
A few high level pointers about SLO definitions:
We loop over labels like datacenter/site, cluster, etc. in puppet to work around a current shortcoming in Pyrra (described in the Grouping workaround section above). This workaround enables Pyrra to output Grafana dashboards for customized views into our defined SLOs.
The indicator in our example is "ratio
" however other indicators such as "latency
" are supported as well. For example the Etcd latency SLO definition is defined with:
'indicator' => {
'latency' => {
'success' => {
'metric' => "etcd_http_successful_duration_seconds_bucket{le=\"0.032\",site=\"${datacenter}\"}"
},
'total' => {
'metric' => "etcd_http_successful_duration_seconds_count{site=\"${datacenter}\"}",
},
},
},