Nova Resource:Metricsinfra/Documentation
The metricsinfra Cloud VPS project contains Prometheus-based monitoring tooling that can be used on any VPS project.
User guide
See Portal:Cloud VPS/Admin/Monitoring#Monitoring for Cloud VPS
Deployment
The OpenStack stuff is deployed via Terraform and the instances are configured via Puppet. The Terraform repo is on GitLab as repos/cloud/metricsinfra/tofu-provisioning.
Components
Prometheus
Prometheus scrapes metrics and stores them on local disk. As of February 2023, storing data for all of Cloud VPS with retention period of 720 hours (30 days) consumes about 80G of disk space. Each scrape target (instance, service, etc) is scraped by two Prometheus nodes to keep short-term on-disk data redundant. In the future, long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos.
Alert manager
Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on prometheus-alerts.wmcloud.org, which generally speaking lets project members silence alerts for that project.
prometheus-configurator
prometheus-configurator (client) and prometheus-manager (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.
Thanos
Thanos is used to provide High Availability between the two Prometheus instances.
Work to do
Goals
- Taavi's long term end goals:
- scrape and store basic metrics from all Cloud VPS instances in all projects ( Done) and have sensible default alert rules
- allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
- Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
- Monitoring and alerting for individual Toolforge tools
- Log aggregation and search for anyone
Prometheus configuration tooling
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, prometheus-configurator (client) and prometheus-manager (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.
- TODO: API and UI to manage config
- TODO (long-term): Allow managing config via puppet manifests on target instances
Data gathering
- TODO (long-term): set up Prometheus push gateway for individual Cloud VPS tenants to use
Alerting
- TODO: custom webhooks
Scaling up
We monitor the basic metrics from all VMs.
- TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)