Pyrra

Pyrra is a prometheus based SLO management and visualization tool. The upstream project is hosted at https://pyrra.dev

Our Pyrra deployment is accessible at https://slo.wikimedia.org, with redirects configured for https://slos.wikimedia.org and https://pyrra.wikimedia.org as well.

Architecture

There are three components of Pyrra

The UI displays SLOs, error budgets, burn rates, etc.
The API delivers information about SLOs from a backend (like Kubernetes) to the UI.
A backend watches for new SLO objects and then creates Prometheus recording rules for each.
- For Kubernetes, there is a Kubernetes Operator available
- For everything else, there is a filesystem-based Operator available

Components

Today we make use of the UI/API and Filesystem components. These are installed using an in-house Debian package https://gitlab.wikimedia.org/repos/sre/pyrra and deployed/configured via puppet to the "titan" (Thanos) hosts.

UI/API component

The UI/API component runs as pyrra-api on "titan" (Thanos) hosts, and serves the web UI seen at https://slo.wikimedia.org

The API service somewhat confusingly also connects to API endpoints provided by the backing operators, in our case filesystem operator.

Filesystem operator component

The filesystem operator runs as pyrra-filesystem on the "titan" (Thanos) hosts. This service processes SLO configuration files placed in /etc/pyrra/configs and outputs recording/alerting rules to /etc/pyrra/output-rules . /etc/pyrra/output-rules in turn is read by the Thanos rule service on the active Thanos rule hosts. The filesystem service also provides an API which is used by the UI component.

Site local specifics

Grouping workaround

Pyrra has a grouping feature, where one SLO definition will be duplicated for each label found in the "group by" setting. However, using this setting disables the metrics that Pyrra itself emits relating to configured SLOs, and these metrics are used in turn to populate grafana detailed SLO dashboards.

A workaround to this has been set up in puppet, by essentially performing the grouping in puppet and letting puppet output a SLO config for each of the desired labels.

This has the added benefit of allowing us to explicitly include/exclude labels from the SLO, whereas the Pyrra grouping feature would automatically deploy new SLOs as soon as a new label is found. Which likely not a behavior we want.

Grafana dashboard for longer range and detail views

Grafana dashboards are deployed to the Pyrra folder and provide overview and detailed views.

These dashboards are based on prometheus metrics which are emitted and scraped from the Pyrra service itself.

https://grafana.wikimedia.org/dashboards/f/ea7f9f57-f24f-4f54-ae90-dd88ec7150a1/pyrra

Onboarding a new SLO with Pyrra

To add a new SLO to Pyrra:

Create a puppet patch. Pyrra SLOs are configured profile::pyrra::filesystem::slos puppet module.

Let's look at the Etcd SLO as an example

    # Etcd requests/errors SLO
    $datacenters.each |$datacenter| {

    # Etcd is eqiad/codfw only
    if $datacenter in [ 'eqiad', 'codfw' ] {
        pyrra::filesystem::config { "etcd-requests-${datacenter}.yaml":
          content => to_yaml({
            'apiVersion' => 'pyrra.dev/v1alpha1',
            'kind' => 'ServiceLevelObjective',
            'metadata' => {
                'name' => 'etcd-requests',
                'namespace' => 'pyrra-o11y-pilot',
                'labels' => {
                    'pyrra.dev/team' => 'serviceops',
                    'pyrra.dev/service' => 'etcd',
                    'pyrra.dev/site' => "${datacenter}", #lint:ignore:only_variable_string
                },
            },
            'spec' => {
                'target' => '99.9',
                'window' => '12w',
                'indicator' => {
                    'ratio' => {
                        'errors' => {
                            'metric' => "etcd_http_failed_total{code=~\"5..\",site=\"${datacenter}\"}",
                        },
                        'total' => {
                            'metric' => "etcd_http_received_total{site=\"${datacenter}\"}",
                        },
                    },
                },
            },
          })

        }
    }
    }

A few high level pointers about SLO definitions:

We loop over labels like datacenter/site, cluster, etc. in puppet to work around a current shortcoming in Pyrra (described in the Grouping workaround section above). This workaround enables Pyrra to output Grafana dashboards for customized views into our defined SLOs.

The indicator in our example is "ratio" however other indicators such as "latency" are supported as well. For example the Etcd latency SLO definition is defined with:

'indicator' => {
    'latency' => {
        'success' => {
            'metric' => "etcd_http_successful_duration_seconds_bucket{le=\"0.032\",site=\"${datacenter}\"}"
        },
        'total' => {
            'metric' => "etcd_http_successful_duration_seconds_count{site=\"${datacenter}\"}",
        },
    },
},