Portal:Cloud VPS/Admin/Runbooks/CephClusterInError

From Wikitech
Jump to navigation Jump to search
The procedures in this runbook require admin permissions to complete.

Error / Incident

Ceph reports health problems, there might be many different issues causing this, some guidelines follow. There's three health levels for ceph:

  • Healthy -> everything ok
  • Warning -> Something is wrong, but the cluster is up and running
  • Critical -> Something is wrong, and it's affecting the cluster

Debugging

Check cluster health details

Ssh to any ceph cluster member and run:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
   osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z
Daemon crash debugging

If the issues is a daemon crashing, you can see more information about the crash by running:

dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new
ID                                                                ENTITY  NEW
2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f  osd.6    *

To get the id of the crash, then check more info with:

dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
{
   "backtrace": [
       "(()+0x12730) [0x7f2c99ba3730]",
       "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]",
...
       "(clone()+0x3f) [0x7f2c997464cf]"
   ],
   "ceph_version": "15.2.11",
...
}

That will give you some hints on trying to find what happened.

Clearing the crash

If you found and solved the issue and/or think it will not happen again, you can clear the crash report with:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f

Or to archive all the new crashes:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive-all

Note that the crash report will not be removed, just not tagged as new (you can still see it with ceph crash <ls|info ID>).

Damaged PG or Inconsistent PG

If the health issue look like:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.c0 is active+clean+inconsistent, acting [9,89,16]

You can try to recover it by forcing a repair:

dcaro@cloudcephosd1019:~$ sudo ceph pg repair 6.c0
instructing pg 6.c0 on osd.9 to repair

Slow operations

We are sometimes having issues with slow ops, the task tracking the current investigation is task T334240.

To gather some extra information when slow operations happen, you can find out which osds are having issues:

root@cloudcephmon1001:~# ceph health detail  # shows the osds
root@cloudcephmon1001:~# ceph osd find osd.218  # for each osd having issues

Then on the osd host that owns the problematic osd/osds:

root@cloudcephmon1001:~# ceph daemon osd.218 ops_in_flight

That gives you the operations that were in flight on that host, how much time they have been stuck and what was the last status for each.

If you suspect it might be related to the task above, please create a paste and put it in the task.

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.

Related information

Example tasks