Portal:Cloud VPS/Admin/Monitoring
This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.
Deployment
There are 2 physical servers:
- cloudmetrics1003.eqiad.wmnet --- currently master
- cloudmetrics1004.eqiad.wmnet --- currently cold standby
Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite, Prometheus and friends.
Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.
These bits are located at modules/profile/manifests/wmcs/monitoring.pp.
Grafana
There is a Grafana server at https://grafana.wmcloud.org which queries data from metricsinfra and other Prometheus instances in Cloud VPS. It is hosted in the metricsinfra project too.
Accessing "labs" prometheus
Our monitoring for physical servers is a mix of production Prometheus/Thanos and the Prometheus setup on the cloudmetrics100x servers. These are mentioned in https://grafana.wikimedia.org as "eqiad prometheus/labs". To access the servers directly in order to troubleshoot what the scrapes are coming up with and more quickly construct queries, you can set up an ssh proxy (ssh tunnel) like so:
user@laptop:~$ ssh -NL 9900:127.0.0.1:9900 cloudmetrics1003.eqiad.wmnet
or:
user@laptop:~$ ssh -NL 9900:127.0.0.1:9900 cloudmetrics1004.eqiad.wmnet
Depending on which server is up / alive / active.
And then point your web browser to http://localhost:9900/labs to bring up the Prometheus web interface. You can then construct and execute PromQL queries as needed per the upstream docs. Note, that sometimes a copied grafana query will not work because it has a grafana variable in it. Just watch out for things with a "$name" format, since that's not PromQL.
Metrics Retention
Our metrics retention policy is 90 days. There are two cronjobs for the _graphite
user that are running on labmon1001
for this task:
archive-deleted-instances
: Moves data from deleted instances to/srv/carbon/whisper/archived_metrics
delete-old-instance-archives
: Deletes archived data that is older than 90 days
This prevents the /srv
partition from becoming full.
The archive-instances
script logs operations to /var/log/graphite/instance-archiver.log
Monitoring for Cloud VPS
The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. Technical documentation for the setup is at Nova Resource:Metricsinfra/Documentation.
Managing alerts
The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh to metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud
and use sudo -i mariadb
to edit the database by hand. (Or ask Taavi to do it for you!)
Rules are defined in the alerts table. You can add a new alert with a query like the following one:
MariaDB [prometheusconfig]> INSERT INTO alerts VALUES (NULL, 12, 'ToolsDBReplicationLagIsTooHigh', 'mysql_slave_status_seconds_behind_master{project="tools"} > 3600', '1m', 'warn', '{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');
The new alert should appear at https://prometheus.wmcloud.org/alerts after a few minutes.
Note that these alerts can not query metrics that are not stored in the metricsinfra Prometheus instance, which includes most notably various Toolforge components. Other Prometheus instances can have however separate mechanisms for configuring alert rules.
Managing notifications
Project viewers and members can use prometheus-alerts.wmcloud.org to create and edit silences for the projects they are in. (Toolforge is an exemption for this general rule: access to creating and editing silences for the tools project is restricted to maintainers of the "admin" tool.) In addition, members of the "admin" and "metricsinfra" projects can manage silences for any project.
Alternatively to silence existing or expected (downtime) notifications you can use the `amtool` command on any metricsinfra alertmanager server (currently for example metricsinfra-alertmanager-1.metricsinfra.eqiad1.wikimedia.cloud). For example to silence all Toolsbeta alerts you could use:
metricsinfra-alertmanager-1:~$ amtool silence add project=toolsbeta -c "per T123456" -d 30d
3e68bf51-63f6-4406-a009-e6765acf5d8e
Links
- Prometheus query interface: https://prometheus.wmcloud.org
- Prometheus active alerts: https://prometheus-alerts.wmcloud.org
- Grafana project overview: https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1
Monitoring for Toolforge
There are metrics for every node in the Toolforge cluster.
Dashboards and handy links
If you want to get an overview of what's going on the Cloud VPS infra, open these links:
Datacenter | What | Mechanism | Comments | Link |
---|---|---|---|---|
eqiad | NFS servers | icinga | labstore1xxx servers | [1] |
eqiad | NFS Server Statistics | grafana | labstore and cloudstore NFS operations, connections and various details | [2] |
eqiad | Cloud VPS main services | icinga | service servers, non virts | [3] |
codfw | Cloud VPS labtest servers | icinga | all physical servers | [4] |
eqiad | Toolforge basic alerts | grafana | some interesting metrics from Toolforge | [5] |
eqiad | ToolsDB (Toolforge R/W MariaDB) | grafana | Database metrics for ToolsDB servers | [6] |
eqiad | Toolforge grid status | custom tool | jobs running on Toolforge's grid | [7] |
any | cloud servers | icinga | all physical servers with the cloudXXXX naming scheme | [8] |
eqiad | Cloud VPS eqiad1 capacity | grafana | capacity planning | [9] |
eqiad | labstore1004/labstore1005 | grafana | load & general metrics | [10] |
eqiad | Cloud VPS eqiad1 | grafana | load & general metrics | [11] |
eqiad | Cloud VPS eqiad1 | grafana | internal openstack metrics | [12] |
eqiad | Cloud VPS eqiad1 | grafana | hypervisor metrics from openstack | [13] |
eqiad | Cloud VPS memcache | grafana | cloudservices servers | [14] |
eqiad | openstack database backend (per host) | grafana | mariadb/galera on cloudcontrols | [15] |
eqiad | openstack database backend (aggregated) | grafana | mariadb/galera on cloudcontrols | [16] |
eqiad | Toolforge | grafana | Arturo's metrics | [17] |
eqiad | Cloud HW eqiad | icinga | Icinga group for WMCS in eqiad | [18] |
eqiad | Toolforge, new kubernetes cluster | prometheus/grafana | Generic dashboard for the new Kubernetes cluster | [19] |
eqiad | Toolforge, new kubernetes cluster, namespaces | prometheus/grafana | Per-namspace dashboard for the new Kubernetes cluster | [20] |
eqiad | Toolforge, new kubernetes cluster, ingress | prometheus/grafana | dashboard about the ingress for the new kubernetes cluster | [21] |
eqiad | Toolforge | prometheus/grafana | dashboard showing a table with basic information about all VMs in the tools project | [22] |
eqiad | Toolforge email server | prometheus/grafana | dashboard showing data about Toolforge exim email server | [23] |
Datacenter | What | Mechanism | Comments | Link |