Alertmanager

From Wikitech
Jump to navigation Jump to search

What is it?

Alertmanager is the service (and software) in charge of collecting, de-duplicating and sending notifications for alerts across WMF infrastructure. It is part of the Prometheus ecosystem and therefore Prometheus itself has native support to act as Alertmanager client. The alerts dashboard, implemented by Karma, can be reached at https://alerts.wikimedia.org/. As of Jan 2021 the dashboard is available for SSO users only (nda and wmf LDAP groups), however a read-only version can be implemented as well.

Alertmanager is being progressively rolled out as the central place where all alerts are sent, the implementation is done in phases according to the alerting infrastructure roadmap. As of Jan 2021 LibreNMS has been fully migrated, with more services to come.

Alertmanager production deployment in Jan 2021

User guide

Onboard

This section guides you through onboarding on AlertManager. The first step is understanding what you'd like to happen to alerts that come in (alerts are notifications in AM parlance). In other words, alerts are going to be routed according to their team and severity labels. Consider the following routing examples for alerts with a fictional team=a-team label:

  • Alerts with label severity=critical will notify #a-team on IRC, and email a-team@
  • Alerts with label severity=warning will notify #a-team on IRC
  • Alerts with label severity=task will create tasks in the #a-team Phabricator project

Alertmanager configuration is held in Puppet file modules/alertmanager/templates/alertmanager.yml.erb.

You'll have a different receiver based on the notifications you'd like to send out. Each receiver instructs Alertmanager on what to do with the alert, for the example above we would have:

    - name: 'a-ircmail'
      webhook_configs:
        - url: 'http://.../a-team'
      email_configs:
        - to: 'a-team@...'
    - name: 'a-irc'
      webhook_configs:
        - url: 'http://.../a-team'
    - name: 'a-task'
      webhook_configs:
        - url: 'http://.../alerts?phid=<phabricator_project_id>'

The resulting routing configuration will match first team= and then route according to severity and select a receiver:

    # A-team routing
    - match:
        team: a
      routes:
        - match:
            severity: critical
          receiver: a-ircmail
        - match:
            severity: warning
          receiver: a-irc
        - match:
            severity: task
          receiver: a-task


The routing tree can be explored and tested using the online routing tree editor. The routing configuration is managed by Puppet and changes are relatively infrequent: on/off boarding teams, changing emails, etc. For a practical example see the patch to add Traffic team alerts.

Create alerts

With alert routing is set up, you can start creating alerts for Alertmanager to handle. Alerts are defined as Prometheus' alerting rules: the alert's metric expression is evaluated periodically and all metrics matching the expression are turned into alerts. Alert names must be in CamelCase. Consider the following example alert on etcd request latencies:

groups:
- name: etcd
  rules:
  - alert: HighRequestLatency
    expr: instance_operation:etcd_request_latencies_summary:avg5m > 50000
    for: 5m
    labels:
      severity: critical
      team: sre
    annotations:
      summary: "etcd request {{ $labels.operation }} high latency"
      description: "etcd is experiencing high average five minutes latency for {{ $labels.operation }}: {{ $value }}ms"
      dashboard: https://...
      runbook: https://...

The example defines an alert named HighRequestLatency based on instance_operation:etcd_request_latencies_summary:avg5m metric. When the expression yields results for more than five minutes, then an alert will be fired for each metric returned by the expression. Each alert will have a list of labels attached, in addition to the result's metric labels, and used for routing the alert. The alert's annotations are used to provide guidance to humans handling the alert, by convention using the following:

summary
Short description of the problem, used where brevity is needed (e.g. IRC)
description
A more extensive description of the symptom and its possible causes. This field will be likely read before jumping to the runbook.
dashboard
A link to the dashboard for the service/problem/etc. TODO if not available yet.
runbook
A link to the service's runbook to follow, ideally linking to the specific alert. TODO if not available yet.

Annotations and labels can be templated as showcased above; the $labels.foo variable lets you access the metric's label foo value and thus make full use of Prometheus' multi-dimensional data model. The Prometheus template examples and template reference are good documents to get you started.

It is worth noting at this point one key difference between Icinga's alert model and Prometheus': Icinga knows about all possible alerts that might fire, whereas Prometheus evaluates an expression. The evaluation might result in one or more alerts, depending on the expression's results; there's no explicit list of all possible labels combinations for all alerts. Also note that alert names don't have to be unique: the same alert can come from different expressions with different labels, for example to implement warning vs critical thresholds. Similarly, different systems can all raise the same alert (e.g. a generic HTTPErrorsHigh alert may be fired from any HTTP handling service if you so choose).

Alerting rules are committed to the operations/alerts repository and, once merged, will be deployed automatically at the next Puppet run (cfr alerts::deploy::prometheus for more information on deployment). When writing alerting rules make sure to include unit tests of rules as per Prometheus documentation: unit tests are run by CI automatically or locally via tox (refer to the repo's README for installation instructions). To test an alert's expression you can also evaluate it at https://thanos.wikimedia.org, make sure to read the section below about "site" label to know more about local and global alerts.

The "site" label

This section documents usage of the site label in alerts. The label is attached automatically to metrics belonging to a particular site (sometimes known as data center), you will see the label in metrics when running queries from Thanos. By default alerts are evaluated independently in each site for reliability reasons, and the site label is available as a so-called "external label" (i.e injected by Prometheus when talking to other systems like Alertmanager or Thanos).

What does this mean in practice when writing alerts and tests?

  • The site label can be accessed with {{ $externalLabels.site }} in string expansion (e.g. alert description or summary)
  • Make sure to include the following snippet in each of your alerts tests to make site available as an external label (eqiad in this case)
external_labels:
  site: eqiad

If your alerts require a global vision (e.g. alerting on metrics difference between eqiad and codfw) see the #Local and global alerts section below. Further, do not hesitate to reach out to SRE Observability

Grafana alerts

It is possible to send Grafana notifications to Alertmanager and get notified accordingly. While using Grafana for alerting is supported, the recommended way (assuming your metrics are in Prometheus and not Graphite) to manage your alerts is to commit Prometheus alerting rules to git as mentioned in the section above.

To configure a new alert follow the instructions below:

  1. Edit the panel and select the "Alert" tab (dashboards with template variables are not supported in alerts as per upstream issue) then "create alert".
  2. Fill in the rule name, this is the alert's name showing up at alerts dashboard: alerts with the same name and different labels will be grouped together. An useful convention for alerts names is to be symptom-oriented and CamelCased without spaces, see also the examples above.
  3. The "evaluate every" field must be set to "1m" to get Alertmanager "alert liveness" logic to work, while the "for" field indicates for how long a threshold must be breached before the alert fires.
  4. Select the conditions for the alert to fire, see also Grafana's create alert documentation
  5. In the notifications section, add "AlertManager". The "message" text area corresponds to the alert's description annotation and is used as a short but indicative text about the alert. In this field you can use templated variables from the alert's expression as per Grafana documentation.
  6. Add the alert's tags: these must contain at least team and severity for proper routing by Alertmanager (see also section above for a detailed description). The dashboard's panel will be linked automatically as the alert's "source" and is available both e.g. in email notifications and on the alerts dashboard.

Silences & acknowledgements

In Alertmanager a silence is used to mute notifications for all alerts matching the silence's labels. Unlike Icinga, silences exist independent of the alerts they are matching: you can create silence for alerts that have yet to fire (this is useful for example when turning up hosts and/or services not yet in production).

To create a new silence select the crossed bell on top right of https://alerts.wikimedia.org to bring up the silence creation form. Then add the matching label names and their values, the silence's duration (hint: you can use the mouse wheel to change the duration's hour/day), a comment and then hit preview. If there are firing alerts they will be displayed in the preview, finally hit submit. At the next interface refresh the alert will be gone from the list of active alerts.

The silence form is available also pre-filled via each alert group's three vertical dots, and the alert's duration dropdown as illustrated below. When using the pre-filled silence form make sure to check the labels and add/remove labels as intended.

Alertmanager alert silence dropdown
Alertmanager group silence dropdown



Within Alertmanager there is no concept of acknowledgement per-se, however any alert with comment starting with ACK! will be considered an acknowledgement. Such alerts will be periodically checked and their expiration extended until there are no matching alerts firing anymore. The acknowledgement functionality is also available from the UI via the "tick mark" next to each alert group, clicking the button will acknowledge the whole alert group. For more information see https://github.com/prymitive/kthxbye#current-acknowledgment-workflow-with-alertmanager.

Finally, by default https://alerts.wikimedia.org will show only active alerts (@state=active filter). To show the suppressed (acked, silenced) alerts use @state=suppressed dashboard filter. To list the current silences: click on the "crossed bell" icon on top right, then click on the "browse" tab.

FAQ

I'm part of a new team that needs onboarding to Alertmanager, what do I need to do?

Broadly speaking, the steps to be onboarded to AM are the following:

  1. Pick a name for your team, this is the team label value to be used in your alerts. A short but identifiable name is recommended. Also note that your production hosts might already have a team defined, check /var/lib/prometheus/node.d/role_owner.prom If that is the case, you already have a team name assigned.
  2. Decide how different alert severities will reach your team (e.g. critical alerts should go to IRC channel #team and email team@). This is achieved by routing alerts in the onboard section
  3. Start sending alerts to Alertmanager! Depending on the preferred method you can create Prometheus-based alerts and/or send alerts from Grafana

What does the "team" label mean in this context?

Every alert must have a "team" label for proper tracking of ownership, in this context a team is considered the same as a team at the organizational level (e.g. SRE, Reading Web, Performance, Observability, etc). Occasionally there may be exceptions of "teams" in a broader sense that don't map to the organization (e.g. "netops", "noc"), these are to be avoided if possible and teams must exist at the org level.

Can I display less information on the alerts dashboard?

Yes. To show only a list of alert groups do the following:

  1. From preferences (cog menu, top right) set "default alert group display" to "always collapsed"
  2. Set "minimal alert group width" to 800 pixels, this will limit the groups displayed per line to one (or two, on very wide browser windows)
  3. Reload the page, you'll see the "severity" multi grid collapsed
  4. Expand each severity grid line by clicking on the right hand side caret (^). The grid will stay expanded when karma itself refreshes, you'll need to expand again on page reload though.

How can I browse alerts history?

All firing and resolved alerts are logged through a webhook into logstash. Individual alert labels/annotations are logged as separate fields for easier filtering/reporting. Make sure to check out the alerts logstash dashboard to browse history.

How do I access the API?

If you'd like to talk to Alertmanager's API (swagger spec for V2) make sure your hostname is in profile::alertmanager::api::rw or profile::alertmanager::api::ro depending on the level of access required and use one of http://alertmanager-<site>.wikimedia.org as your endpoint, where site is either eqiad or codfw, given that the two endpoints are clustered and replicated.

Do alert names have to be unique?

No, in fact it is encouraged to re-use the same alert name across multiple services and rules (e.g. same alert name but varying severity and thresholds; or for example re-using a generic "JVMTrashing" alert name across different services/jobs/etc)

I just +2'd and merged my alerts.git change, what's next?

After the merge your alerts.git change will be auto-deployed at the next puppet run. If you have permissions and are in a hurry you can force a deploy with cumin:

sudo cumin C:alerts 'run-puppet-agent'

How do I get an alert to open a task?

The tl;dr is that you'll need to have an appropriate route configured in alertmanager with a suitable receiver for your Phabricator project. Once that is done, your team's alerts with severity=task will open a task as needed/necessary.

Refer to the Alertmanager#Onboard section on how to get your team set up in Alertmanager. The alertmanager configuration contains example for existing teams you can take as a template. Note that tasks opened as a result of alerts will not be closed once the alert stops firing.

My new alert should be firing but it isn't

Try at least two things:

If none of the above gives any clue, ask for help.

The IRC bot has quit my channel and hasn't joined yet, what's up?

The IRC bot (jinxer-wm) will join its configured channels (via alertmanager routing) on-demand when an alert first fires, it will then stay on the channel until a restart of the bot occurs. Therefore re-join will happen automatically and on-demand, if you know of an alert that has fired and has not notified IRC but it should have, please reach out to SRE Observability.

Local and global alerts

Alerting rules in alerts.git are evaluated on each site-local Prometheus (main sites and PoPs included) by default for reliability reasons (e.g. the alert semantics don't depend on a site being reachable by the others). In the majority of cases local (i.e. per-site) alerts are what we want, and make it easy to e.g. silence all alerts from a given site. In some cases however a "global vision" is needed to alert, for example when monitoring traffic levels across the whole infrastructure.

There are significant risks to think about when deploying global alerts, mostly centered around reliability: global alerts rely on Thanos for querying all Prometheus hosts and query failures might happen, therefore it is important to think about what this means for the alert. When evaluating rules Thanos will abort the query on partial responses by default, in other words the alert might be missing some of symptoms when that happens. Contrast this with local alerts which are evaluated by the local Prometheus and therefore are as reliable as Prometheus itself.

deploy-tag

This facility allows you to select a specific Prometheus instance (local alerts) or Thanos (global alerts). Place the compulsory tag as a comment at the beginning of your alert file. For example:

# deploy-tag: ops

Will deploy the alerts within the file to ops instance only. Globs are supported (shell-style) therefore the following is supported to target all k8s instances:

# deploy-tag: k8s*

Additionally you can use a comma-separated list of instances to deploy to. Using this tag will also enable linting of your alerts via pint (for example, checking that the metrics didn't disappear). The linting failures are reported in the pint_problem metric, see also https://phabricator.wikimedia.org/T309182 for more context on this work.

If you are not sure which instance(s) to deploy to, and your metrics are already live, you can search for them on https://thanos.wikimedia.org and look for the value of the prometheus label.

Deploy alerts to specific sites

Using deploy-site you can restrict on which sites the alerts file will be deployed. The feature is useful both stand-alone (e.g. partial rollouts of alerts) and in combination with absent() (e.g. when the targeted service is deployed only in a subset of sites). To select the site use the following syntax:

# deploy-site: <comma-separated list of sites>

The site label

The site label is one of the notable differences between local and global alerts: for local alerts the label is not present in the metrics themselves (e.g. "by (site)" grouping isn't possible), though it is available as a fixed external label (in other words attached by Prometheus itself, see also Prometheus configuration and alerting rules) and can be used in annotations templates with {{ $externalLabels.site }}. For global alerts site is a label as any other and can be used normally e.g. for grouping.

To deploy a global alert make sure to include the following at the beginning of your alerting rule file. The file will be deployed to Thanos rule component, also make sure to read about the caveats of global alerts in the section above!

# deploy-tag: global

Software stack

When talking about the Alertmanager stack as a whole it is useful to list its components as deployed at Wikimedia Foundation, namely the following software is:

  • Alertmanager the daemon actually in charge of handling alerts and sending out notifications
  • alertmanager-irc-relay forwards alerts to IRC channels
  • Karma the dashboard/UI for Alertmanager alerts, it powers https://alerts.wikimedia.org
  • kthxbye implements the "acknowledgement" functionality for alerts
  • phalerts is a simple service that implements Alertmanager webhook receiver API and creates/updates Phabricator tasks based on alert notifications from Alertmanager.
  • prometheus-icinga-exporter compatibility shim to forward active Icinga alerts to Alertmanager, also provides Prometheus-style metrics for Icinga

Upgrading phalerts

Ensure your git remotes look like this:

origin	git@gitlab.wikimedia.org:repos/sre/phalerts.git (fetch)
origin	git@gitlab.wikimedia.org:repos/sre/phalerts.git (push)
upstream	https://github.com/knyar/phalerts.git (fetch)
upstream	https://github.com/knyar/phalerts.git (push)

Make sure you have fetched the latest changes from both "upstream" and "origin" repositories using the following commands:

git fetch upstream
git fetch origin

Find the commits that are unique to each repository:

  1. Commits in "upstream" but not in "origin": git log --oneline upstream/master ^origin/master
  2. Commits in "origin" but not in "upstream": git log --oneline origin/master ^upstream/master

Create a new tag based on the latest origin commit:

upstream_branch=upstream/master
new=$(git log --date=format:%Y%m%d --pretty=0+git%cd.%h -1 $upstream_branch)
git checkout packaging-wikimedia
git tag upstream/$new $upstream_branch
git push upstream/$new
git merge --no-ff upstream/$new
dch -v "${new}-1" -m "New upstream release"
# check debian/changelog for correctness, then git add debian/changelog
git push origin

Notifications

As of Jan 2021, Alertmanager supports the following notification methods:

  • email - sent by Alertmanager itself
  • IRC - via the jinxer-wm bot on Libera.chat
  • phabricator - through @phaultfinder user
  • pages - sent via Splunk Oncall (formerly known as VictorOps)

Notification preferences are set per-team and are based on the alert' severity (respectively the team and severity labels attached to the alert)

Operations

Alerts

Cluster configuration is out of sync

The condition might happen if the alertmanager cluster hosts can't communicate with each other, or the connection between the two is flappy/lossy. Check the logs of prometheus-alertmanager unit on alert hosts for more hints/details of what's wrong.

Alert linting found problems

We run 'pint' to find problems in alerting rules both offline (CI) and online (against a live Prometheus instance). The exact fix will depend on the nature of the problem. A few common problems are the following:

Prometheus didn't have any series (promql/series)

The metric in question can't be found on the Prometheus instance. Perhaps the metric disappeared (e.g. it was renamed due to a software upgrade) and the alert must be adjusted accordingly. If this is a new alert it is also possible the Prometheus instance selected (ops, k8s, etc) doesn't have the metric and it is not supposed to. In this case the Prometheus instance selector ("deploy-tag" at the beginning of the alert file) must be adjusted, and/or the site ("deploy-site" comment).

rate() should be used with counters but <foo> is a gauge (promql/rate)

The metric metadata reports it as a gauge, and the alert expression applies rate() to it. Most likely the metric has been mis-typed as a gauge whereas in reality it is a counter (i.e. never decreasing). Most likely the underlying software needs to be adjusted to fix the mismatch, if the metric is indeed a gauge then consider using delta() or deriv() instead.

Add a silence via CLI

The amtool command provides easy access to the Alertmanager API from command line. For example to add a silence (e.g. on alert1001)

amtool silence add --duration 2d --comment 'Known alert' 'alertname=~foobar.*'

Fire a test alert via CLI

You can use amtool to also fire alerts. This is a blunt tool and to be used with caution! Make sure to always indicate somewhere the alert is a test and not a real issue!

 amtool alert add TestAlert team=sre severity=warning job=testjob key=value --annotation=runbook=lol --annotation=description='this is a test, please ignore' --annotation=dashboard=no

The alert will auto-resolve after ~20 minutes, if you want to keep it firing you have to repeat the call above before the timeout expires.

Continuous Integration (CI)

CI for the alerts repo uses tox and can be ran manually however you will first need to install pint and promtool. Please see the README.md file in the alerts repo for full instructions.