Data Platform/Systems/Ceph/Troubleshooting
Detecting Disk Failures
We can use the following Cumin command to check for evidence of failing disks across the cluster.
sudo cumin A:cephosd 'dmesg -T |egrep -i "medium|i\/o error|sector|Prefailure"'
Output that incidates a failing disk might look as shown, for example.
===== NODE GROUP =====
(1) cephosd1001.eqiad.wmnet
----- OUTPUT of 'dmesg -T |egrep ...ctor|Prefailure"' -----
[Sun May 18 21:16:29 2025] sd 0:0:14:0: [sdk] tag#9028 Sense Key : Medium Error [current]
[Sun May 18 21:16:29 2025] critical medium error, dev sdk, sector 14029512 op 0x0:(READ) flags 0x0 phys_seg 15 prio class 2
At this point, we can identify that the host with the problematic disk is cephosd1001 and the device affected is /dev/sdk.
If these errors are widespread or frequent, then we may wish to look at replacing the disk before it fails.
Identifying OSD associated with disk failures
We can identify which OSD is associated with this device by using the command:
sudo ceph device ls-by-host ${HOSTNAME}|grep ${DEVICE}
With the example above, this would be:
btullis@cephosd1001:~$ sudo ceph device ls-by-host cephosd1001|grep sdk
SEAGATE_ST18000NM006J_ZR5BNZ5V sdk osd.3
Which shows that OSD 3 is associated with the affected drive. You can get more detailed information with the ceph-volume lvm list command, optionally limiting it to the OSD in which you are interested. e.g.
btullis@cephosd1001:~$ sudo ceph-volume lvm list 3
====== osd.3 =======
[block] /dev/ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4/osd-block-8513dc17-cdf8-4de4-9d0a-366bf4554487
block device /dev/ceph-16404e38-f89c-4245-9e7e-4ebdea6399b4/osd-block-8513dc17-cdf8-4de4-9d0a-366bf4554487
block uuid 1Ntrqx-ygSO-G6vV-0iku-SW2M-H1jz-r37OwU
cephx lockbox secret
cluster fsid 6d4278e1-ea45-4d29-86fe-85b44c150813
cluster name ceph
crush device class hdd
db device /dev/nvme0n1p4
db uuid 90d25982-ab90-4682-8710-56a1339cb0e2
encrypted 0
osd fsid 8513dc17-cdf8-4de4-9d0a-366bf4554487
osd id 3
osdspec affinity
type block
vdo 0
devices /dev/sdk
[db] /dev/nvme0n1p4
PARTUUID 90d25982-ab90-4682-8710-56a1339cb0e2
Identifying the physical slot associated with an OSD or device
Identifying which drive bay is associated with an OSD or a specific device is sometimes difficult. Ceph doesn't have any native support for this, but when arranging for the replacement of a disk we need to know the controller, enclosure and slot numbers. We have two things that can help us here.
udevadmcan obtain the serial number associated with a deviceperccli64will also show us the serial numbers of the drives that are loaded into each slot.
We have a puppet fact that is based on the JSON output of the `perccli /call show all J` command.
We can obtain the serial number of our device:
btullis@cephosd1001:~$ sudo udevadm info --query=property --property=ID_SCSI_SERIAL --value /dev/sdk
ZR5BNZ5V
We can then obtain the identifier for the controller, enclosure, and slot number for the disk with this serial number:
btullis@cephosd1001:~$ sudo facter -p ceph_disks -j | jq '.ceph_disks[].disks|with_entries(select(.value.serial=="ZR5BP8E6"))'
{
"c0/e23/s8": {
"controller": "0",
"enclosure": "23",
"interface": "SAS",
"medium": "HDD",
"serial": "ZR5BP8E6",
"slot": "8",
"wwn": "5000C500D9BB3FA4"
}
}
In this case, the enclosure number is 23 and the slot number of 8. The device to be absented from puppet is: c0e23s8
Replacing Failed Disks
When we have identified which disk is failing and which in which physical slot the disk is in, we can remove it from the cluster and arrange for a replacement.
Its identifier can be added to the profile::ceph::osd::absent_osds list in hiera, which will cause the OSD to be removed from the cluster, ready for replacement.
Common Alerts
CephOSDClusterInWarning
Inconsistent Placement Groups
This ticket contains comprehensive troubleshooting for this alert. We'll move the content here soon.
Related alerts
Ceph provides the persistent storage for DPE applications running in the dse-k8s-cluster. As such, problems with Ceph can lead to instability with k8s applications.
For example, this ticket covers a recent alert for
FIRING: AirflowDeploymentUnavailable: airflow-scheduler.airflow-search is unavailable on dse-k8s-eqiad
, which was directly caused by Ceph.
You may be able to test this by execing into the pod, navigating to its volume mounts, and attempting simple operations like ls or touch hello.txt