This page contains administrator information about the Toolforge prometheus setup and how to manage it.
There should be a couple of VMs with the puppet role
The VM instances should be big enough to hold all the metrics we collect (usually more than 100GB of data).
All the configuration (and metrics data) is stored in /srv/prometheus/tools/.
Among other things, this sets up a systemd service which is the main thing running in these server:
aborrero@tools-prometheus-03:~$ sudo systemctl status firstname.lastname@example.org ● email@example.com - prometheus server (instance tools) Loaded: loaded (/firstname.lastname@example.org; static; vendor preset: enabled) Active: active (running) since Thu 2020-02-13 18:15:02 UTC; 15h ago Main PID: 1517 (prometheus) Tasks: 21 (limit: 4915) Memory: 9.4G CGroup: /email@example.com └─1517 /usr/bin/prometheus --storage.tsdb.path /srv/prometheus/tools/metrics --web.listen-addr [..]
Query targets are defined in
profile::toolforge::prometheus in a very long inlined yaml.
NOTE: there is no relationship between these prometheus servers and cloudmetrics systems. They collect different metrics and use different grafana to show it.
The HA approach is active/cold-standby. Both VMs collect exactly the same metrics, so there is no need for any specific sync between de VMs for data redundancy.
There is a web proxy with the name tools-prometheus.wmflabs.org created using horizon, pointing to the active VM. This proxy can be used to inspect the status of prometheus. Useful links:
Worth noting that this URL is also what's used by our grafana setup to use this prometheus instance as a data source.
Some global hiera keys are needed for prometheus to be able to query for metrics:
prometheus_nodes: - tools-prometheus-03.tools.eqiad1.wikimedia.cloud - tools-prometheus-04.tools.eqiad1.wikimedia.cloud
Basically this enables ferm and other ACL mechanisms.
NOTE: given how the ferm setup is done, in case a VM is created with the same FQDN, you need to restart ferm manually to pick up the new IP address.
The Toolforge Prometheus hosts automatically provision their alert rules from the cloud/toolforge/alerts GitLab repository.
Since all prometheus VMs collect all the metrics, there is no need to do any specific sync before doing failover. Data should be already there in the backup/standby server.
- switch the proxy URL to point to the standby server, using Horizon.
- check that the proxy URL https://tools-prometheus.wmflabs.org/tools/targets works when pointing to the new VM.
In case you want to migrate metrics data, for example from an old VM to a new one, there are a couple of caveats:
- if you scp or rsync the data while prometheus is collecting them, the resulting data copy in the destination VM will be inconsistent and prometheus will refuse to start.
- to avoid consistency problems, the recommendation is to shutdown the prometheus@tools systemd service in both the source/destination VM while doing manual data syncronization.
- Prometheus upstream docs on target configuration: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- Grafana instance to use: https://grafana.wmcloud.org