User:Filippo Giunchedi/Prometheus POC

From Wikitech

This page provides an overview of the Prometheus POC that has been running in labs.

prometheus.wmflabs.org

As part of https://phabricator.wikimedia.org/T92813 there is a prometheus proof of concept running using only labs resources at https://prometheus.wmflabs.org and used to test prometheus deployment scenarios.

A "global" server running at https://prometheus.wmflabs.org scrapes several targets for metrics, for example mysql monitoring. Another use case is aggregation of data from several per-project prometheus servers, e.g https://swift-prometheus.wmflabs.org.

machine-level monitoring

The machine-level metrics on each monitored labs instance are provided by prometheus-node-exporter which takes care of exporting vital machine metrics e.g. disk/mem/cpu/network. The current metrics can be asked via HTTP in the same way the prometheus server will ask for them, e.g. to get the 1min load of test-prometheus2

test-prometheus3:~$ curl test-prometheus2:9100/metrics -s | grep ^node_load
node_load1 0.01

Once a metric has been collected from a target it can be plotted over time, for example 1min load average for monitoring project.

federation

Each per-project prometheus servers polls only its project's instances every 15s for metrics (scrape_interval in config), which get aggregated under the project: prefix (according to rule_files in config) and in turn polled by the "global" prometheus server every 60s (federate in the config). Also note that the list of project-local instances is auto discovered by each per-project prometheus server by asking the wikitech API and generating the respective config file, the list of targets is reloaded by prometheus upon change. A sample dashboard aggregating all projects can be found at https://prometheus.wmflabs.org/grafana/dashboard/db/projects

mysqld monitoring

The global server also scrapes other exporters other than node-exporter, for example mysqld-exporter has been setup as part of https://phabricator.wikimedia.org/T126757 to read mysql metrics from labsdb machines with a sample grafana dashboard.

blackbox monitoring

The third proof of concept exporter is blackbox-exporter, used to probe endpoints using TCP/HTTP(S)/ICMP from each it is running on. In this case there is only one exporter running at test-prometheus2:9115 and can be asked for prometheus metrics via HTTP on its /probe endpoint. For example the HTTP request below is similar to what prometheus is doing while probing the blackbox for en.wikipedia.org, in this case probe_success is determined by receiving an HTTP 200 (after redirects).

prometheus-server:~$ curl 'test-prometheus2:9115/probe?target=en.wikipedia.org:80&module=http_2xx'
probe_ssl_earliest_cert_expiry 1481409964.000000
probe_http_status_code 200
probe_http_content_length -1
probe_http_redirects 2
probe_http_ssl 1
probe_duration_seconds 0.063040
probe_success 1

Also note that ssl cert expiry is reported since this HTTPS, so it is possible to enumerate which certificates are going to expire as seen by the blackbox over HTTPS. There's a sample dashboard at https://prometheus.wmflabs.org/grafana/dashboard/db/http-s-tcp-probes with an overview from HTTPS and TCP probes.

In a bigger deployment blackbox exporters can be located on any machine running in a given network failure domain, e.g. per-site / per-vlan, etc. and probing for external or internal endpoints.