Nova Resource:Metricsinfra/Documentation

The metricsinfra Cloud VPS project contains Prometheus-based monitoring tooling that can be used on any VPS project.

User guide

See Portal:Cloud VPS/Admin/Monitoring#Monitoring for Cloud VPS

Deployment

The OpenStack stuff is deployed via Terraform and the instances are configured via Puppet. The Terraform repo is on GitLab as repos/cloud/metricsinfra/tofu-provisioning.

Please DO NOT touch the OpenStack configuration (for example VMs or web proxies) manually! You will just make the Terraform stuff out of date which is annoying to fix.

Components

List of all instances and other project details

Prometheus

Prometheus scrapes metrics and stores them on local disk. As of February 2023, storing data for all of Cloud VPS with retention period of 720 hours (30 days) consumes about 80G of disk space. Each scrape target (instance, service, etc) is scraped by two Prometheus nodes to keep short-term on-disk data redundant. In the future, long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos.

Alert manager

Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on prometheus-alerts.wmcloud.org, which generally speaking lets project members silence alerts for that project.

prometheus-configurator

prometheus-configurator (client) and prometheus-manager (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.

Thanos

Thanos is used to provide High Availability between the two Prometheus instances.

Work to do

Goals

Tracked in Phabricator
Task T266050

Taavi's long term end goals:
- scrape and store basic metrics from all Cloud VPS instances in all projects ( Done) and have sensible default alert rules
- allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
- Monitoring and alerting for individual Toolforge tools
- Log aggregation and search for anyone

Prometheus configuration tooling

Tracked in Phabricator
Task T284993

Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, prometheus-configurator (client) and prometheus-manager (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.

TODO: API and UI to manage config
TODO (long-term): Allow managing config via puppet manifests on target instances

Data gathering

TODO (long-term): set up Prometheus push gateway for individual Cloud VPS tenants to use

Alerting

TODO: custom webhooks

Scaling up

We monitor the basic metrics from all VMs.

TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)