Portal:Cloud VPS/Admin/Runbooks/CephClusterInUnknown

The procedures in this runbook require admin permissions to complete.

We are not getting metrics from the ceph cluster, it might or might not be running.

Debugging

Check the health details of the cluster:

root@cloudcephmon1001:~# ceph health detail

If the cluster is unhealthy yyou can see more debugging guidelines see Portal:Cloud VPS/Admin/Runbooks/CephClusterInError for other debugging and details.

Next we can check if the prometheus node can reach the ceph mgr daemon, to do so, you can login to the prometheus instance:

dcaro@urcuchillay$ wm-ssh prometheus1005.eqiad.wmnet
dcaro@prometheus1005:~# sudo -i
root@prometheus1005:~# curl -v http://cloudcephmon1001:9283/metrics

If you don't get anything then it might be a network issue, or the mgr daemon might be misbehaving, you can try restarting it.

If it works, you can continue digging on the prometheus side, to find the instance that we are running you can:

root@prometheus1005:~# ps aux | grep prometheus | grep cloud
prometh+    1087  3.4  0.1 30494908 678072 ?     Ssl  Jun06 3462:03 /usr/bin/prometheus --storage.tsdb.path /srv/prometheus/cloud/metrics --web.listen-address 127.0.0.1:9904 --web.external-url https://prometheus-eqiad.wikimedia.org/cloud --storage.tsdb.retention 4032h --config.file /srv/prometheus/cloud/prometheus.yml --storage.tsdb.max-block-duration=24h --storage.tsdb.min-block-duration=2h --query.max-samples=10000000
prometh+    1115  5.4  0.0 923936 73660 ?        Ssl  Jun06 5502:21 /usr/bin/thanos sidecar --http-address 0.0.0.0:19904 --grpc-address 0.0.0.0:29904 --tsdb.path /srv/prometheus/cloud/metrics --prometheus.url http://localhost:9904/cloud --objstore.config-file /etc/thanos-sidecar@cloud/objstore.yaml --min-time=-15d --shipper.ignore-unequal-block-size

And go from there (logs, config, etc.).

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.

Related information

Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
Internal documentation: Portal:Cloud_VPS/Admin/Ceph
Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/

Example tasks

https://phabricator.wikimedia.org/T372528 - [ceph] Metrics started not responding during the drain