The OpenStack stuff is deployed via Terraform and the instances are configured via Puppet.
Prometheus scrapes metrics and stores them on local disk. As of February 2023, storing data for all of Cloud VPS with retention period of 720 hours (30 days) consumes about 80G of disk space. Each scrape target (instance, service, etc) is scraped by two Prometheus nodes to keep short-term on-disk data redundant. In the future, long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos.
Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on prometheus-alerts.wmcloud.org, which generally speaking lets project members silence alerts for that project.
prometheus-configurator (client) and prometheus-manager (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.
Thanos is used to provide High Availability between the two Prometheus instances.
Work to do
- Taavi's long term end goals:
- scrape and store basic metrics from all Cloud VPS instances in all projects ( Done) and have sensible default alert rules
- allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
- Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
- Monitoring and alerting for individual Toolforge tools
- Log aggregation and search for anyone
Prometheus configuration tooling
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, prometheus-configurator (client) and prometheus-manager (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.
- TODO: API and UI to manage config
- TODO (long-term): Allow managing config via puppet manifests on target instances
- TODO (long-term): set up Prometheus push gateway for individual Cloud VPS tenants to use
- TODO: custom webhooks
We monitor the basic metrics from all VMs.
- TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
Server admin log
- 18:53 taavi: no longer send quarry alerts to cloud services team
- 14:09 taavi: reboot metricsinfra-alertmanager-1 to see if it stops flapping a puppet alert
- 08:24 wm-bot2: dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
- 08:17 wm-bot2: dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console
- 08:17 wm-bot2: dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
- 08:16 wm-bot2: dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm... (more)