Portal:Cloud VPS/Admin/Runbooks/CephClusterInUnknown
We are not getting metrics from the ceph cluster, it might or might not be running.
Debugging
Check the health details of the cluster:
root@cloudcephmon1001:~# ceph health detail
If the cluster is unhealthy yyou can see more debugging guidelines see Portal:Cloud VPS/Admin/Runbooks/CephClusterInError for other debugging and details.
Next we can check if the prometheus node can reach the ceph mgr daemon, to do so, you can login to the prometheus instance:
dcaro@urcuchillay$ wm-ssh prometheus1005.eqiad.wmnet dcaro@prometheus1005:~# sudo -i root@prometheus1005:~# curl -v http://cloudcephmon1001:9283/metrics
If you don't get anything then it might be a network issue, or the mgr daemon might be misbehaving, you can try restarting it.
If it works, you can continue digging on the prometheus side, to find the instance that we are running you can:
root@prometheus1005:~# ps aux | grep prometheus | grep cloud prometh+ 1087 3.4 0.1 30494908 678072 ? Ssl Jun06 3462:03 /usr/bin/prometheus --storage.tsdb.path /srv/prometheus/cloud/metrics --web.listen-address 127.0.0.1:9904 --web.external-url https://prometheus-eqiad.wikimedia.org/cloud --storage.tsdb.retention 4032h --config.file /srv/prometheus/cloud/prometheus.yml --storage.tsdb.max-block-duration=24h --storage.tsdb.min-block-duration=2h --query.max-samples=10000000 prometh+ 1115 5.4 0.0 923936 73660 ? Ssl Jun06 5502:21 /usr/bin/thanos sidecar --http-address 0.0.0.0:19904 --grpc-address 0.0.0.0:29904 --tsdb.path /srv/prometheus/cloud/metrics --prometheus.url http://localhost:9904/cloud --objstore.config-file /etc/thanos-sidecar@cloud/objstore.yaml --min-time=-15d --shipper.ignore-unequal-block-size
And go from there (logs, config, etc.).
Support contacts
Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.
Related information
- Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
- Internal documentation: Portal:Cloud_VPS/Admin/Ceph
- Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/
Example tasks
- https://phabricator.wikimedia.org/T372528 - [ceph] Metrics started not responding during the drain