Portal:Cloud VPS/Admin/Runbooks/CephClusterInWarning

The procedures in this runbook require admin permissions to complete.

The ceph cluster is in warning status, this means that it's not highly available anymore or something might be affecting it's performance, but the cluster is still up and running.

Debugging

Check the health details of the cluster:

root@cloudcephmon1001:~# ceph health detail

For more debugging guidelines see Portal:Cloud VPS/Admin/Runbooks/CephClusterInError for other debugging and details.

Mons using a lot of disk space

We saw this once, the health details show:

root@cloudcephmon1001:~# ceph health detail
HEALTH_WARN mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space
[WRN] MON_DISK_BIG: mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space
   mon.cloudcephmon1003 is 15 GiB >= mon_data_size_warn (15 GiB)
   mon.cloudcephmon1002 is 15 GiB >= mon_data_size_warn (15 GiB)
   mon.cloudcephmon1001 is 15 GiB >= mon_data_size_warn (15 GiB)

This usually means that the cluster is waiting for some backfilling/repair to finish but it's taking too long for whichever reason and the trimming actions on the mons datastores are not running (a regular working size right now is ~200MB).

In the last instance, that was not the case, but we did not find the root cause.

To fix you can force the mons to trigger a trim by restarting all of them, one by one to keep the cluster alive:

Note: you can use the cookbook wmcs.ceph.roll_reboot_mon_daemons once this is merged.

root@cloudcephmon1001:~# systemctl restart ceph-mon@cloudcephmon1001
...
root@cloudcephmon1001:~# ceph status  # check that the mon is back online and in the cluster
 # repeat on the others
root@cloudcephmon1002:~# systemctl restart ceph-mon@cloudcephmon1002

Slow ops

See Portal:Cloud VPS/Admin/Runbooks/CephSlowOps.

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.

Related information

Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
Internal documentation: Portal:Cloud_VPS/Admin/Ceph
Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/

Example tasks

https://phabricator.wikimedia.org/T286649 - OSD daemon crash