User:Razzi/T280132 disk swap

From Wikitech
Jump to navigation Jump to search

https://phabricator.wikimedia.org/T280132

First let me see how the host (an-worker1100) is doing

razzi@an-worker1100:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
=== RaidStatus completed

Sweet

https://phabricator.wikimedia.org/T280132#7007970

> I did the following:
> - commented the disk in /etc/fstab
> - umounted it manually - sudo umount /var/lib/hadoop/data/k
> - ran puppet to regenerate the list of datadir for yarn and hdfs
> - the yarn nodemanager was down due to this problem, but puppet brought it up again after 3)
  • Uncommented disk
  • Ran:
sudo mount -a
mount: /var/lib/hadoop/data/k: can't find UUID=7bcd4c25-a157-4023-a346-924d4ccee5a0.

Ok, so I guess the disk has a new uuid

ls -l /dev/disk/by-uuid/

Hmm that shows only /dev/sdX, but I don't know which disk it is.

There's also this link: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk

# From the previous commands you should be able to fill in the variables 
# with the values of the disk's properties indicated below:
# X => Enclosure Device ID
# Y => Slot Number
# Z => Controller (Adapter) number
megacli -PDMakeGood -PhysDrv[X:Y] -aZ

so now I'm on step 6 I want to do something like:

> Add the single disk RAID0 array (use the details from the steps above):
sudo megacli -CfgLdAdd -r0 [32:0] -a0

Given I have:

Adapter #0
...
Enclosure Device ID: 32
Slot Number: 11
Firmware state: Online, Spun Up

I will run

sudo megacli -CfgLdAdd -r0 [32:11] -a0

Ok I ran this but got:

Exit Code: 0x1a
razzi@an-worker1100:~$ echo $?
26

Ok I see on this webpage: https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages

0x1a Maximum LDs are already configured

So maybe it's already configured. Let me try to proceed.

Well, I want to figure out which /dev/sd? it is, and I can't figure out how to figure out which one it is, but one of the uuids won't show up in /etc/fstab.

for u in $( ls /dev/disk/by-uuid/); do echo $u; cat /etc/fstab | grep $u; done

bash loop did the trick... e97258d2-5661-469a-9d34-56bd84a80714 is the one.

But wait, there's also 91c728b2-0dc9-4755-841c-ecdab46d38ae...

a7ab9126-4ef4-4824-a41c-69b4f8630edb

Hmm these are all dm-0, 1 2... not what I want. Maybe the disk isn't showing up yet

ls /dev/sd? | wc gives 23.

Yeah I think the disk isn't showing up. I'll comment on the task

Ok I copied the wrong part of the output

Enclosure Device ID: 32
Slot Number: 10
Firmware state: Unconfigured(good), Spun Up

That's the right disk.

razzi@an-worker1100:~$ sudo megacli -CfgLdAdd -r0 [32:10] -a0

Adapter 0: Created VD 10

Adapter 0: Configured the Adapter!!

Now it shows sdl as unused in lsblk:

sdk                     8:160  0  1.8T  0 disk
└─sdk1                  8:161  0  1.8T  0 part  /var/lib/hadoop/data/m
sdl                     8:176  0  1.8T  0 disk
sdm                     8:192  0  1.8T  0 disk
└─sdm1                  8:193  0  1.8T  0 part  /var/lib/hadoop/data/q

Now I want the disk uuid, but it's not showing in blkid or /dev/disk/by-uuid/...

Oh right, it doesn't have a partition yet, and the partition has the uuid.

sudo parted /dev/sdl --script mklabel gpt
sudo parted /dev/sdl --script mkpart primary ext4 0% 100%
sudo mkfs.ext4 -L hadoop-k /dev/sdl1
sudo tune2fs -m 0 /dev/sdl1

Now lsblk shows its uuid. cb58c727-dec9-4abf-8b21-3d70a6443b6d

But it's not showing its space in lsblk...

sdl
└─sdl1             ext4              hadoop-k        cb58c727-dec9-4abf-8b21-3d70a6443b6d
sdm
└─sdm1             ext4              hadoop-q        766882b0-078f-4bc3-b118-3f8456446b52    380.2G    79% /var/lib/hadoop/data/q

It looks like it does mount, but doesn't stay.

Perhaps https://unix.stackexchange.com/questions/474743/mount-command-finishes-successully-but-disk-is-not-mounted

Yep

Apr 29 15:38:09 an-worker1100 kernel: [1810702.143355] EXT4-fs (sdl1): mounted filesystem with ordered data mode. Opts: (null)
Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-7bcd4c25\x2da157\x2d4023\x2da346\x2d924d4ccee5a0.device. Stopping, too.
Apr 29 15:38:09 an-worker1100 systemd[1]: Unmounting /var/lib/hadoop/data/k...
Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Succeeded.

systemctl daemon-reexec might fix it. It did!