Jump to content

Sloth

From Wikitech
(Redirected from Pyrra)
https://sloth.dev

Sloth is the tool that SRE uses to manage SLO metrics, dashboards and alerts. It is a prometheus based SLO (service level objective) management tool. The upstream project is hosted at https://sloth.dev.

Our Sloth deployment is accessible in two ways:

Grafana exposes two dashboards Sloth SLO Detail, and Sloth SLO High Level

A Gitlab repository in which SLO manifests are managed as code https://gitlab.wikimedia.org/repos/sre/slothslos

Architecture

There are three components to our Sloth deployment:

  • A Gitops repository provides the configuration interface and validation for SLO changes displays SLOs, error budgets, burn rates, etc.
  • The sloth software which is run from the Thanos (titan) hosts, this is what translates SLO manifests to prometheus recording rules and deploys them to the metrics and alerting infrastructure.
  • A set of Grafana dashboards to provide the user facing SLO UI

Site local conventions

Revisions

We've implemented support for a revision label through a site local sli plugin. This enables us to distinguish between different versions of metrics expressions used to form the SLO, as they change over time.

When initially onboarding an SLO the revision would be 1. Then, as SLO impacting changes are made, the revision label may be incremented in the patchset to signify a new "formula" behind the SLO. The Grafana dashboards automatically detect the available revision label values, and present them as a drop down.

Name convention

We've aimed to prevent overlap between the service name and slos name, as sloth combines these into a sloth id in the form $service-$slosname

CI will notify you if the value of the service name overlaps with the slo name. For instance it would complain about service: etcd, slos name: etcd-latency-eqiad. Updating to service:etcd, slos name: latency-eqiad will solve this condition.

Onboarding a new SLO with Sloth

To add a new SLO to Sloth, upload a slothslos patch via Gitlab.

Let's look at the Etcd SLO as an example

# slothslos/etcd/etcd.yml

version: "prometheus/v1"
service: "etcd"
labels:
  team: "serviceops"
  revision: 1
slos:
  - name: "latency-eqiad"
    objective: 99.8
    description: "etcd latency SLO - 32ms threshold - eqiad"
    labels:
      site: "eqiad"
    sli:
      events:
        error_query: |
          (
            sum(rate(etcd_http_successful_duration_seconds_count{site="eqiad"}[{{.window}}]))
            -
            sum(rate(etcd_http_successful_duration_seconds_bucket{site="eqiad",le="0.032"}[{{.window}}]))
          )
        total_query: sum(rate(etcd_http_successful_duration_seconds_count{site="eqiad"}[{{.window}}]))
    alerting:
      name: "SLOBudgetBurn"
      labels:
        service: "etcd"
        site: "eqiad"
      annotations:
        summary: "etcd latency is below 99.8% target (>32ms) in eqiad"
      page_alert:
        disable: false
        labels:
          severity: warning  #TODO: increase to critical in the future
      ticket_alert:
        disable: false
        labels:
          severity: task

  - name: "latency-codfw"
    objective: 99.8
    description: "etcd latency SLO - 32ms threshold - codfw"
    labels:
      site: "codfw"
    sli:
      events:
        error_query: |
          (
            sum(rate(etcd_http_successful_duration_seconds_count{site="codfw"}[{{.window}}]))
            -
            sum(rate(etcd_http_successful_duration_seconds_bucket{site="codfw",le="0.032"}[{{.window}}]))
          )
        total_query: sum(rate(etcd_http_successful_duration_seconds_count{site="codfw"}[{{.window}}]))
    alerting:
      name: "SLOBudgetBurn"
      labels:
        service: "etcd"
        site: "codfw"
      annotations:
        summary: "etcd latency is below 99.8% target (>32ms) in codfw"
      page_alert:
        disable: false
        labels:
          severity: warning
      ticket_alert:
        disable: false
        labels:
          severity: task

  - name: "requests-eqiad"
    objective: 99.9
    description: "etcd request success rate SLO - eqiad site"
    labels:
      site: "eqiad"
    sli:
      events:
        error_query: sum(rate(etcd_http_failed_total{code=~"5..",site="eqiad"}[{{.window}}]))
        total_query: sum(rate(etcd_http_received_total{site="eqiad"}[{{.window}}]))
    alerting:
      name: "SLOBudgetBurn"
      labels:
        service: "etcd"
        site: "eqiad"
      annotations:
        summary: "etcd request success rate is below 99.9% target in eqiad"
      page_alert:
        disable: false
        labels:
          severity: warning
      ticket_alert:
        disable: false
        labels:
          severity: task

  - name: "requests-codfw"
    objective: 99.9
    description: "etcd request success rate SLO - codfw site"
    labels:
      site: "codfw"
    sli:
      events:
        error_query: sum(rate(etcd_http_failed_total{code=~"5..",site="codfw"}[{{.window}}]))
        total_query: sum(rate(etcd_http_received_total{site="codfw"}[{{.window}}]))
    alerting:
      name: "SLOBudgetBurn"
      labels:
        service: "etcd"
        site: "codfw"
      annotations:
        summary: "etcd request success rate is below 99.9% target in codfw"
      page_alert:
        disable: false
        labels:
          severity: warning
      ticket_alert:
        disable: false
        labels:
          severity: task