This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.

Monitoring for Cloud VPS metal infrastructure

We have our own instance in the wikiprod Prometheus setup. As of writing (Oct 2023), it's only in eqiad, but that might change. It's configured via the profile::prometheus::cloud Puppet profile.

To query it, use https://thanos.wikimedia.org or https://prometheus-eqiad.wikimedia.org/cloud/. To craft dashboards, use the Grafana instance at https://grafana.wikimedia.org.

metricsinfra: Monitoring services for Cloud VPS

The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. Technical documentation for the setup is at Nova Resource:Metricsinfra/Documentation.

Metricsinfra Prometheus

The metricsinfra Prometheus server scrapes base instance-level metrics from ALL Puppetized Cloud VPS instances.

Metricsinfra Prometheus CAN be used for:

  • Base instance-level metrics
    • example: node-exporter
  • Small project-specific services that have a low metric count and cardinality. Ask Taavi if unsure.

Metricsinfra Prometheus MUST NOT be used for:

  • Project-specific services that require complex configuration, or have a large metric count or cardinality that requires a large amount of storage or compute resources to process
    • Deploy a project-specific Prometheus instance instead, and hook it up to the Metricsinfra Alertmanager and Grafana services.
  • Metrics that contain private information

Managing scrape targets

Managing alert rules

The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh to metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and use sudo -i mariadb to edit the database by hand. (Or ask Taavi to do it for you!)

Rules are defined in the alerts table. You can add a new alert with a query like the following one:

MariaDB [prometheusconfig]> INSERT INTO alerts VALUES (NULL, 12, 'ToolsDBReplicationLagIsTooHigh', 'mysql_slave_status_seconds_behind_master{project="tools"} > 3600', '1m', 'warning', '{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');

The new alert should appear at https://prometheus.wmcloud.org/alerts after a few minutes.

Note that these alerts can not query metrics that are not stored in the metricsinfra Prometheus instance, which includes most notably various Toolforge components. Other Prometheus instances can have however separate mechanisms for configuring alert rules.

Metricsinfra Alertmanager

The metricsinfra project has an Alertmanager instance that will send out alerts via IRC, email or VictorOps. In addition to the metricsinfra Prometheus instance, other Prometheus instances in WMCS-managed projects can use this instance to send out alerts.

Silencing alerts

Project viewers and members can use prometheus-alerts.wmcloud.org to create and edit silences for the projects they are in. (Toolforge is an exemption for this general rule: access to creating and editing silences for the tools project is restricted to maintainers of the "admin" tool.) In addition, members of the "admin" and "metricsinfra" projects can manage silences for any project.

Alternatively to silence existing or expected (downtime) notifications you can use the `amtool` command on any metricsinfra alertmanager server (currently for example metricsinfra-alertmanager-1.metricsinfra.eqiad1.wikimedia.cloud). For example to silence all Toolsbeta alerts you could use:

metricsinfra-alertmanager-1:~$ amtool silence add project=toolsbeta -c "per T123456" -d 30d

Managing notification groups

Managing ACLs

Managing access for project-specific Prometheus instances

I repeat: only WMCS-managed projects can use this method. WMCS-managed means projects that are considered part of Cloud VPS infrastructure, Toolforge or Data Services (the three Cloud Services product categories), and where everyone with access is required to comply with Help:Access policies. The reason for this is that any project with this level of metricsinfra access has the ability to send pages to the WMCS team.

Change the profile::wmcs::metricsinfra::alertmanager::project_proxy::trusted_hosts Hiera key (managed via Horizon on the metricsinfra project) to include the per-project Prometheus servers to allow. Right now it is just host-level authentication, no secrets involved unfortunately.

Then, in the Prometheus server config, use something like this:

  - openstack_sd_configs:
    - role: instance
      region: eqiad1-r
      identity_endpoint: https://openstack.eqiad1.wikimediacloud.org:25000/v3
      username: novaobserver
      domain_name: default
      project_name: metricsinfra
      all_tenants: false
      refresh_interval: 5m
      port: 8643
    - source_labels:
      - __meta_openstack_instance_name
      action: keep
      regex: metricsinfra-alertmanager-\d+
    - source_labels:
      - __meta_openstack_instance_name
      target_label: instance
    - source_labels:
      - __meta_openstack_instance_status
      action: keep
      regex: ACTIVE
  - target_label: source
    replacement: prometheus
    action: replace
  - target_label: project
    action: replace

Metricsinfra Grafana

The Metricsinfra Grafana instance is used to draw dashboards from Prometheus data. Like the metricsinfra Alertmanager instance, it can be used with per-project Prometheus servers in addition to the metricsinfra Prometheus server.

Managing data sources

Data sources are managed via modules/profile/files/wmcs/metricsinfra/grafana/datasources.yaml in the Puppet repository.

Monitoring for Toolforge

In addition to the Metricsinfra setup, Toolforge has its own Prometheus server for Kubernetes metrics. It's queriable via https://prometheus.svc.toolforge.org/tools/, and uses the metricsinfra grafana and alertmanager instances. Alerts are configured via https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts. The toolsbeta equivalent is queriable via https://prometheus.svc.beta.toolforge.org/tools/.

Dashboards and handy links

If you want to get an overview of what's going on the Cloud VPS infra, open these links:

Datacenter What Mechanism Comments Link
eqiad NFS servers icinga labstore1xxx servers [1]
eqiad NFS Server Statistics grafana labstore and cloudstore NFS operations, connections and various details [2]
eqiad Cloud VPS main services icinga service servers, non virts [3]
codfw Cloud VPS labtest servers icinga all physical servers [4]
eqiad Toolforge basic alerts grafana some interesting metrics from Toolforge [5]
eqiad ToolsDB (Toolforge R/W MariaDB) grafana Database metrics for ToolsDB servers [6]
eqiad Toolforge grid status custom tool jobs running on Toolforge's grid [7]
any cloud servers icinga all physical servers with the cloudXXXX naming scheme [8]
eqiad Cloud VPS eqiad1 capacity grafana capacity planning [9]
eqiad labstore1004/labstore1005 grafana load & general metrics [10]
eqiad Cloud VPS eqiad1 grafana load & general metrics [11]
eqiad Cloud VPS eqiad1 grafana internal openstack metrics [12]
eqiad Cloud VPS eqiad1 grafana hypervisor metrics from openstack [13]
eqiad Cloud VPS memcache grafana cloudservices servers [14]
eqiad openstack database backend (per host) grafana mariadb/galera on cloudcontrols [15]
eqiad openstack database backend (aggregated) grafana mariadb/galera on cloudcontrols [16]
eqiad Toolforge grafana Arturo's metrics [17]
eqiad Cloud HW eqiad icinga Icinga group for WMCS in eqiad [18]
eqiad Toolforge, new kubernetes cluster prometheus/grafana Generic dashboard for the new Kubernetes cluster [19]
eqiad Toolforge, new kubernetes cluster, namespaces prometheus/grafana Per-namspace dashboard for the new Kubernetes cluster [20]
eqiad Toolforge, new kubernetes cluster, ingress prometheus/grafana dashboard about the ingress for the new kubernetes cluster [21]
eqiad Toolforge prometheus/grafana dashboard showing a table with basic information about all VMs in the tools project [22]
eqiad Toolforge email server prometheus/grafana dashboard showing data about Toolforge exim email server [23]
Datacenter What Mechanism Comments Link

See also