User:Razzi/an-master reimaging

From Wikitech

https://phabricator.wikimedia.org/T278423

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#High_Availability

razzi@an-master1002:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
active

looks good

In terms of what to do when reimaging, I will refer to cookbooks/sre/hadoop/roll-restart-masters.py

       logger.info("Restarting Yarn Resourcemanager on Master.")
       hadoop_master.run_sync('systemctl restart hadoop-yarn-resourcemanager')

ok so can `systemctl stop hadoop-yarn-resourcemanager` on standby

       logger.info("Restart HDFS Namenode on the master.")
       hadoop_master.run_async(
           'systemctl restart hadoop-hdfs-zkfc',
           'systemctl restart hadoop-hdfs-namenode')

also:

systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-hdfs-namenode

It's similar to the comment here: https://phabricator.wikimedia.org/T265126#7008232

One more service:

logger.info("Restart MapReduce historyserver on the master.")
hadoop_master.run_sync('systemctl restart hadoop-mapreduce-historyserver')

so good idea to `systemctl stop hadoop-mapreduce-historyserver` on the active.

summary

# prep
backup /srv/hadoop/name, could be my home directory on a statbox
confirm with luca / andrew that the plan looks good, and let them know we're ready to begin

# pre-check
check for any hadoop-related alarms

# failover and back
Check that 1001 is active and 1002 is standby
Do failover on an-master1001:
 - systemctl stop hadoop-hdfs-namenode
 - systemctl stop hadoop-yarn-resourcemanager

Check that 1002 became active: sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

Check metrics: https://grafana.wikimedia.org/d/000000585/hadoop
HDFS Namenode
Yarn Resource Manager

on an-master1002:
 - systemctl stop hadoop-hdfs-namenode
 - systemctl stop hadoop-yarn-resourcemanager

# Start the reimage
disable puppet on an-master1001 and an-master1002
merge puppet patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/682785/
run puppet on install1003 to ensure this change is picked up
Failover hdfs and yarn to an-master1002:
 - systemctl stop hadoop-hdfs-namenode
 - systemctl stop hadoop-yarn-resourcemanager
 - systemctl stop hadoop-hdfs-zkfc
 - systemctl stop hadoop-mapreduce-historyserver
Check that an-master1002 is active as expected, wait a moment, check with team to make sure everything looks healthy
Start reimage on cumin1001: sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
Since this has reuse-partitions-test, will have to connect to console and confirm that partitions look good (potentially destructive step, check with Luca before proceeding)
Once machine comes up, confirm that proper os version is installed, hadoop services are running, /srv partition has data, node is in standby state. Since machine was down, hadoop namenode service will need to catch up. This should show in hdfs under-replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=41&orgId=1
Once everything looks good, manually failover 1002 -> 1001 (do this without stopping hdfs, so that if necessary things can switch back):  sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet

# Repeat reimage on an-master1002
Stop hadoop daemons on 1002, reimage, confirm that 1002 comes back as healthy standby, done!

Risk: an-master1002 does not work as active

Mitigation: do a test failover to an-master1002 and ensure everything is working before reimaging an-master1001, so that we can switch back if necessary

Risk: active fails while standby is down

Mitigation: Backup /srv/hadoop/name? Since hdfs is constantly written to, this would get out of date, but would be better than losing all data. We could set up another standby, an-master1003, perhaps temporarily as a virtual machine. Realistically this is a low-risk scenario, but worth considering as this would be the worst scenario and could lead to data loss

Risk: hadoop doesn't work on latest debian 10

Mitigation: an-test-master is already running on debian 10, so we have some confidence this will not happen; we can go over steps to reimage back to debian 9.13

Current an-master disk configuration:

razzi@an-master1001:~$ lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 223.6G  0 disk
├─sda1                         8:1    0  46.6G  0 part
│ └─md0                        9:0    0  46.5G  0 raid1 /
├─sda2                         8:2    0   954M  0 part
│ └─md1                        9:1    0 953.4M  0 raid1 [SWAP]
└─sda3                         8:3    0 176.1G  0 part
  └─md2                        9:2    0   176G  0 raid1
   └─an--master1001--vg-srv 253:0    0   176G  0 lvm   /srv
sdb                            8:16   0 223.6G  0 disk
├─sdb1                         8:17   0  46.6G  0 part
│ └─md0                        9:0    0  46.5G  0 raid1 /
├─sdb2                         8:18   0   954M  0 part
│ └─md1                        9:1    0 953.4M  0 raid1 [SWAP]
└─sdb3                         8:19   0 176.1G  0 part
  └─md2                        9:2    0   176G  0 raid1
   └─an--master1001--vg-srv 253:0    0   176G  0 lvm   /srv

versus modules/install_server/files/autoinstall/partman/reuse-raid1-2dev.cfg:

# this workarounds LP #1012629 / Debian #666974
# it makes grub-installer to jump to step 2, where it uses bootdev
d-i    grub-installer/only_debian      boolean         false
d-i    grub-installer/bootdev  string  /dev/sda /dev/sdb

d-i    partman/reuse_partitions_recipe         string \
                /dev/sda|1 biosboot ignore none|2 raid ignore none, \
                /dev/sdb|1 biosboot ignore none|2 raid ignore none, \
                /dev/mapper/*-root|1 ext4 format /, \
                /dev/mapper/*-srv|1 ext4 keep /srv

So we'll want to set it to reuse-parts-test.cfg to confirm

High level plan - check that everything is healthy: nodes on grafana, ensure active / standby is what we expect

- merge patch to set an-master1002 to reuse-parts-test.cfg with a custom partman/custom/reuse-analytics-hadoop-master.cfg. linux-host-entries.ttyS1-115200 already does not have a pxeboot entry so it will use buster upon reimaging

^- Do we want to add logical volumes for swap and root?

- stop hadoop daemons on an-master1002, downtime node, disable puppet

- run reimage, wait for node to come online, ensure things are healthy

- failover to newly-reimaged node, ensure things are still working

- repeat the steps to stop, update, and reimage an-master1001

Ok solved the mystery of sda1 / sdb1 on an-test-master1001: they are a bios boot partition.

razzi@an-test-master1001:~$ lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 447.1G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 371.3G  0 lvm   /srv
sdb              8:16   0 447.1G  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 371.3G  0 lvm   /srv
razzi@an-test-master1001:~$ sudo fdisk -l
Disk /dev/sdb: 447.1 GiB, 480103981056 bytes, 937703088 sectors
Disk model: MZ7LH480HAHQ0D3
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: C5C638C5-0FFB-4A5C-A4C5-53A22860E315

Device      Start       End   Sectors   Size Type
/dev/sdb1    2048    585727    583680   285M BIOS boot
/dev/sdb2  585728 937701375 937115648 446.9G Linux RAID

Writing partman/custom/reuse-analytics-hadoop-master.cfg

Currently we have

modules/install_server/files/autoinstall/partman/custom/reuse-dbprov.cfg

d-i	partman/reuse_partitions_recipe	string \
		 /dev/sda|1 ext4 format /|2 linux-swap ignore none|3 unknown ignore none, \
		 /dev/mapper/*|1 xfs keep /srv, \
		 /dev/sdb|1 xfs keep /srv/backups/dumps/ongoing

which shows how to refer to / as ext4 without a logical volume `root` (though we might still want to do this).