Portal:Cloud VPS/Admin/Runbooks/CephClusterInError
Error / Incident
Ceph reports health problems, there might be many different issues causing this, some guidelines follow. There's three health levels for ceph:
- Healthy -> everything ok
- Warning -> Something is wrong, but the cluster is up and running
- Critical -> Something is wrong, and it's affecting the cluster
Debugging
Check cluster health details
Ssh to any ceph cluster member and run:
dcaro@cloudcephosd1019:~$ sudo ceph health detail HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z
Daemon crash debugging
If the issues is a daemon crashing, you can see more information about the crash by running:
dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new ID ENTITY NEW 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f osd.6 *
To get the id of the crash, then check more info with:
dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f { "backtrace": [ "(()+0x12730) [0x7f2c99ba3730]", "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]", ... "(clone()+0x3f) [0x7f2c997464cf]" ], "ceph_version": "15.2.11", ... }
That will give you some hints on trying to find what happened.
Clearing the crash
If you found and solved the issue and/or think it will not happen again, you can clear the crash report with:
dcaro@cloudcephosd1019:~$ sudo ceph crash archive 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
Or to archive all the new crashes:
dcaro@cloudcephosd1019:~$ sudo ceph crash archive-all
Note that the crash report will not be removed, just not tagged as new (you can still see it with ceph crash <ls|info ID>
).
Damaged PG or Inconsistent PG
If the health issue look like:
dcaro@cloudcephosd1019:~$ sudo ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 6.c0 is active+clean+inconsistent, acting [9,89,16]
You can try to recover it by forcing a repair:
dcaro@cloudcephosd1019:~$ sudo ceph pg repair 6.c0 instructing pg 6.c0 on osd.9 to repair
Slow operations
See Portal:Cloud VPS/Admin/Runbooks/CephSlowOps.
Support contacts
Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.
Related information
- Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health
- Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
- Internal documentation: Portal:Cloud_VPS/Admin/Ceph
- Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/
Example tasks
- https://phabricator.wikimedia.org/T286649 - OSD daemon crash