SRE/Dc-operations/Sw raid rebuild directions
SRE Data Center Operations
DC Operations | About | Projects & Workboards | IRC: #wikimedia-dcops connect
HW Troubleshooting | HW Specific Documentation
When a defective disk is swapped out on a sw raid, it is not automatically rebuilt. Rebuilding requires adding the new disk in with the following procedure (uses wdqs2007 disk replacement from this task as an example):
check to see if the new disk is detected along with the existing disks:
sudo lshw -class disk
copy the parition table of sda to sdh (sdh was replaced)
sudo sgdisk -R /dev/sdh /dev/sda
create a random guid for sdh
sudo sgdisk -G /dev/sdh
audit output of both disks to ensure they now match
sudo sgdisk -p /dev/sda
sudo sgdisk -p /dev/sdh
add the new SSD back into the array
sudo mdadm --manage /dev/md0 --add /dev/sdh2
In some cases, mdraid may automatically create and add the new disk to a separate sw raid. This will manifest with the above command returning
mdadm: Cannot open /dev/sdh2: Device or resource busy
as well as /proc/mdstat
showing an inactive raid that isn't md0
, for instance
cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md127 : inactive sda2[7](S) 937267200 blocks super 1.2 md0 : active raid1 sdb2[1] 937267200 blocks super 1.2 [2/1] [_U] bitmap: 6/7 pages [24KB], 65536KB chunk
In that case, copy the inactive raid id and stop it
sudo mdadm --manage /dev/md127 --stop
then retry adding the device to /dev/md0
check the status
robh@wdqs2007:~$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid10 sdh2[8] sda2[0] sde2[4] sdf2[5] sdg2[6] sdc2[2] sdd2[3] sdb2[1] 3749068800 blocks super 1.2 512K chunks 2 near-copies [8/7] [UUUUUUU_] [>....................] recovery = 0.0% (132736/937267200) finish=470.6min speed=33184K/sec bitmap: 28/28 pages [112KB], 65536KB chunk unused devices: <none>