Portal:Cloud VPS/Admin/Runbooks/CephClusterInError
Error / Incident
Ceph reports health problems, there might be many different issues causing this, some guidelines follow. There's three health levels for ceph:
- Healthy -> everything ok
- Warning -> Something is wrong, but the cluster is up and running
- Critical -> Something is wrong, and it's affecting the cluster
Debugging
Check cluster health details
Ssh to any ceph cluster member and run:
dcaro@cloudcephosd1019:~$ sudo ceph health detail HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z
Daemon crash debugging
If the issues is a daemon crashing, you can see more information about the crash by running:
dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new ID ENTITY NEW 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f osd.6 *
To get the id of the crash, then check more info with:
dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f { "backtrace": [ "(()+0x12730) [0x7f2c99ba3730]", "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]", ... "(clone()+0x3f) [0x7f2c997464cf]" ], "ceph_version": "15.2.11", ... }
That will give you some hints on trying to find what happened.
Clearing the crash
If you found and solved the issue and/or think it will not happen again, you can clear the crash report with:
dcaro@cloudcephosd1019:~$ sudo ceph crash archive 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
Or to archive all the new crashes:
dcaro@cloudcephosd1019:~$ sudo ceph crash archive-all
Note that the crash report will not be removed, just not tagged as new (you can still see it with ceph crash <ls|info ID>
).
Damaged PG or Inconsistent PG
If the health issue look like:
dcaro@cloudcephosd1019:~$ sudo ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 6.c0 is active+clean+inconsistent, acting [9,89,16]
You can try to recover it by forcing a repair:
dcaro@cloudcephosd1019:~$ sudo ceph pg repair 6.c0 instructing pg 6.c0 on osd.9 to repair
Slow operations
We are sometimes having issues with slow ops, the task tracking the current investigation is task T334240.
To gather some extra information when slow operations happen, you can find out which osds are having issues:
root@cloudcephmon1001:~# ceph health detail # shows the osds root@cloudcephmon1001:~# ceph osd find osd.218 # for each osd having issues
Then on the osd host that owns the problematic osd/osds:
root@cloudcephmon1001:~# ceph daemon osd.218 ops_in_flight
That gives you the operations that were in flight on that host, how much time they have been stuck and what was the last status for each.
If you suspect it might be related to the task above, please create a paste and put it in the task.
Support contacts
Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.
Related information
- Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health
- Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
- Internal documentation: Portal:Cloud_VPS/Admin/Ceph
- Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/
Example tasks
- https://phabricator.wikimedia.org/T286649 - OSD daemon crash