Portal:Toolforge/Admin/Prometheus

From Wikitech

This page contains administrator information about the Toolforge prometheus setup and how to manage it.

Setup

There should be a couple of VMs with the puppet role role::wmcs::toolforge::prometheus. The VM instances should be big enough to hold all the metrics we collect (usually more than 100GB of data).

All the configuration (and metrics data) is stored in /srv/prometheus/tools/.

Among other things, this sets up a systemd service which is the main thing running in these server:

aborrero@tools-prometheus-03:~$ sudo systemctl status prometheus@tools.service
● prometheus@tools.service - prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static; vendor preset: enabled)
   Active: active (running) since Thu 2020-02-13 18:15:02 UTC; 15h ago
 Main PID: 1517 (prometheus)
    Tasks: 21 (limit: 4915)
   Memory: 9.4G
   CGroup: /system.slice/system-prometheus.slice/prometheus@tools.service
           └─1517 /usr/bin/prometheus --storage.tsdb.path /srv/prometheus/tools/metrics --web.listen-addr
[..]

Query targets are defined in profile::toolforge::prometheus in a very long inlined yaml.

NOTE: there is no relationship between these prometheus servers and cloudmetrics systems. They collect different metrics and use different grafana to show it.

Redundancy

The HA approach is active/cold-standby. Both VMs collect exactly the same metrics, so there is no need for any specific sync between de VMs for data redundancy.

There is a web proxy with the name tools-prometheus.wmflabs.org created using horizon, pointing to the active VM. This proxy can be used to inspect the status of prometheus. Useful links:

Worth noting that this URL is also what's used by our grafana setup to use this prometheus instance as a data source.

Hiera

Some global hiera keys are needed for prometheus to be able to query for metrics:

prometheus_nodes:
- tools-prometheus-03.tools.eqiad1.wikimedia.cloud
- tools-prometheus-04.tools.eqiad1.wikimedia.cloud

Basically this enables ferm and other ACL mechanisms.

NOTE: given how the ferm setup is done, in case a VM is created with the same FQDN, you need to restart ferm manually to pick up the new IP address.

Alerts

The Toolforge Prometheus hosts automatically provision their alert rules from the cloud/toolforge/alerts GitLab repository.

Failover

Since all prometheus VMs collect all the metrics, there is no need to do any specific sync before doing failover. Data should be already there in the backup/standby server.

Data migration

In case you want to migrate metrics data, for example from an old VM to a new one, there are a couple of caveats:

  • if you scp or rsync the data while prometheus is collecting them, the resulting data copy in the destination VM will be inconsistent and prometheus will refuse to start.
  • to avoid consistency problems, the recommendation is to shutdown the prometheus@tools systemd service in both the source/destination VM while doing manual data syncronization.

See also