Portal:Cloud VPS/Admin/Runbooks/CephClusterInError

The procedures in this runbook require admin permissions to complete.

Error / Incident

Ceph reports health problems, there might be many different issues causing this, some guidelines follow. There's three health levels for ceph:

Healthy -> everything ok
Warning -> Something is wrong, but the cluster is up and running
Critical -> Something is wrong, and it's affecting the cluster

Debugging

Check cluster health details

Ssh to any ceph cluster member and run:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
   osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z

Daemon crash debugging

If the issues is a daemon crashing, you can see more information about the crash by running:

dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new
ID                                                                ENTITY  NEW
2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f  osd.6    *

To get the id of the crash, then check more info with:

dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
{
   "backtrace": [
       "(()+0x12730) [0x7f2c99ba3730]",
       "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]",
...
       "(clone()+0x3f) [0x7f2c997464cf]"
   ],
   "ceph_version": "15.2.11",
...
}

That will give you some hints on trying to find what happened.

Clearing the crash

If you found and solved the issue and/or think it will not happen again, you can clear the crash report with:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f

Or to archive all the new crashes:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive-all

Note that the crash report will not be removed, just not tagged as new (you can still see it with ceph crash <ls|info ID>).

Damaged PG or Inconsistent PG

If the health issue look like:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.c0 is active+clean+inconsistent, acting [9,89,16]

You can try to recover it by forcing a repair:

dcaro@cloudcephosd1019:~$ sudo ceph pg repair 6.c0
instructing pg 6.c0 on osd.9 to repair

Slow operations

See Portal:Cloud VPS/Admin/Runbooks/CephSlowOps.

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.

Related information

Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health
Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
Internal documentation: Portal:Cloud_VPS/Admin/Ceph
Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/

Example tasks

https://phabricator.wikimedia.org/T286649 - OSD daemon crash