Error / Incident
The ceph cluster is in warning status, this means that it's not highly available anymore or something might be affecting it's performance, but the cluster is still up and running.
Check the health details of the cluster:
root@cloudcephmon1001:~# ceph health detail
For more debugging guidelines see Portal:Cloud VPS/Admin/Runbooks/CephClusterInError for other debugging and details.
Mons using a lot of disk space
We saw this once, the health details show:
root@cloudcephmon1001:~# ceph health detail HEALTH_WARN mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space [WRN] MON_DISK_BIG: mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space mon.cloudcephmon1003 is 15 GiB >= mon_data_size_warn (15 GiB) mon.cloudcephmon1002 is 15 GiB >= mon_data_size_warn (15 GiB) mon.cloudcephmon1001 is 15 GiB >= mon_data_size_warn (15 GiB)
This usually means that the cluster is waiting for some backfilling/repair to finish but it's taking too long for whichever reason and the trimming actions on the mons datastores are not running (a regular working size right now is ~200MB).
In the last instance, that was not the case, but we did not find the root cause.
To fix you can force the mons to trigger a trim by restarting all of them, one by one to keep the cluster alive:
Note: you can use the cookbook
wmcs.ceph.roll_reboot_mon_daemons once this is merged.
root@cloudcephmon1001:~# systemctl restart ceph-mon@cloudcephmon1001 ... root@cloudcephmon1001:~# ceph status # check that the mon is back online and in the cluster # repeat on the others root@cloudcephmon1002:~# systemctl restart ceph-mon@cloudcephmon1002
Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.
- Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
- Internal documentation: Portal:Cloud_VPS/Admin/Ceph
- Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/
- https://phabricator.wikimedia.org/T286649 - OSD daemon crash