Jump to content

Data Platform/Systems/Ceph/Troubleshooting

From Wikitech

Detecting Disk Failures

We can use the following Cumin command to check for evidence of failing disks across the cluster.

sudo cumin A:cephosd 'dmesg -T |egrep -i "medium|i\/o error|sector|Prefailure"'

Output that incidates a failing disk might look as shown, for example.

===== NODE GROUP =====                                                                                                                                                                                             
(1) cephosd1001.eqiad.wmnet                                                                                                                                                                                        
----- OUTPUT of 'dmesg -T |egrep ...ctor|Prefailure"' ----- 
[Sun May 18 21:16:29 2025] sd 0:0:14:0: [sdk] tag#9028 Sense Key : Medium Error [current]                                                                                                                          
[Sun May 18 21:16:29 2025] critical medium error, dev sdk, sector 14029512 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 2

At this point, we can identify that the host with the problematic disk is cephosd1001 and the device affected is /dev/sdk.

If these errors are widespread or frequent, then we may wish to look at replacing the disk before it fails.

Identifying OSD associated with disk failures

We can identify which OSD is associated with this device by using the command:

sudo ceph device ls-by-host ${HOSTNAME}|grep ${DEVICE}

With the example above, this would be:

btullis@cephosd1001:~$ sudo ceph device ls-by-host cephosd1001|grep sdk
SEAGATE_ST18000NM006J_ZR5BNZ5V                 sdk      osd.3

Which shows that OSD 3 is associated with the affected drive. You can get more detailed information with the ceph-volume lvm list command, optionally limiting it to the OSD in which you are interested. e.g.

btullis@cephosd1001:~$ sudo ceph-volume lvm list 3

====== osd.3 =======

  [block]       /dev/ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4/osd-block-8513dc17-cdf8-4de4-9d0a-366bf4554487

      block device              /dev/ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4/osd-block-8513dc17-cdf8-4de4-9d0a-366bf4554487
      block uuid                1Ntrqx-ygSO-G6vV-0iku-SW2M-H1jz-r37OwU
      cephx lockbox secret      
      cluster fsid              6d4278e1-ea45-4d29-86fe-85b44c150813
      cluster name              ceph
      crush device class        hdd
      db device                 /dev/nvme0n1p4
      db uuid                   90d25982-ab90-4682-8710-56a1339cb0e2
      encrypted                 0
      osd fsid                  8513dc17-cdf8-4de4-9d0a-366bf4554487
      osd id                    3
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/sdk

  [db]          /dev/nvme0n1p4

      PARTUUID                  90d25982-ab90-4682-8710-56a1339cb0e2

Identifying the physical slot associated with an OSD or device

Identifying which drive bay is associated with an OSD or a specific device is sometimes difficult. Ceph doesn't have any native support for this, but when arranging for the replacement of a disk we need to know the controller, enclosure and slot numbers. We have two things that can help us here.

  • udevadm can obtain the serial number associated with a device
  • perccli64 will also show us the serial numbers of the drives that are loaded into each slot.

We have a puppet fact that is based on the JSON output of the `perccli /call show all J` command.

We can obtain the serial number of our device:

btullis@cephosd1001:~$ sudo udevadm info --query=property --property=ID_SCSI_SERIAL --value /dev/sdk 
ZR5BNZ5V

We can then obtain the identifier for the controller, enclosure, and slot number for the disk with this serial number:

btullis@cephosd1001:~$ sudo facter -p ceph_disks -j | jq '.ceph_disks[].disks|with_entries(select(.value.serial=="ZR5BP8E6"))'
{
  "c0/e23/s8": {
    "controller": "0",
    "enclosure": "23",
    "interface": "SAS",
    "medium": "HDD",
    "serial": "ZR5BP8E6",
    "slot": "8",
    "wwn": "5000C500D9BB3FA4"
  }
}

In this case, the enclosure number is 23 and the slot number of 8. The device to be absented from puppet is: c0e23s8

Replacing Failed Disks

When we have identified which disk is failing and which in which physical slot the disk is in, we can remove it from the cluster and arrange for a replacement.

Its identifier can be added to the profile::ceph::osd::absent_osds list in hiera, which will cause the OSD to be removed from the cluster, ready for replacement.

Common Alerts

CephOSDClusterInWarning

Inconsistent Placement Groups

This ticket contains comprehensive troubleshooting for this alert. We'll move the content here soon.

Ceph provides the persistent storage for DPE applications running in the dse-k8s-cluster. As such, problems with Ceph can lead to instability with k8s applications.

For example, this ticket covers a recent alert for

FIRING: AirflowDeploymentUnavailable: airflow-scheduler.airflow-search is unavailable on dse-k8s-eqiad

, which was directly caused by Ceph.

You may be able to test this by execing into the pod, navigating to its volume mounts, and attempting simple operations like ls or touch hello.txt