razzi@an-master1002:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet active
In terms of what to do when reimaging, I will refer to cookbooks/sre/hadoop/roll-restart-masters.py
logger.info("Restarting Yarn Resourcemanager on Master.") hadoop_master.run_sync('systemctl restart hadoop-yarn-resourcemanager')
ok so can `systemctl stop hadoop-yarn-resourcemanager` on standby
logger.info("Restart HDFS Namenode on the master.") hadoop_master.run_async( 'systemctl restart hadoop-hdfs-zkfc', 'systemctl restart hadoop-hdfs-namenode')
systemctl stop hadoop-hdfs-zkfc systemctl stop hadoop-hdfs-namenode
It's similar to the comment here: https://phabricator.wikimedia.org/T265126#7008232
One more service:
logger.info("Restart MapReduce historyserver on the master.") hadoop_master.run_sync('systemctl restart hadoop-mapreduce-historyserver')
so good idea to `systemctl stop hadoop-mapreduce-historyserver` on the active.
# prep backup /srv/hadoop/name, could be my home directory on a statbox confirm with luca / andrew that the plan looks good, and let them know we're ready to begin # pre-check check for any hadoop-related alarms # failover and back Check that 1001 is active and 1002 is standby Do failover on an-master1001: - systemctl stop hadoop-hdfs-namenode - systemctl stop hadoop-yarn-resourcemanager Check that 1002 became active: sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet Check metrics: https://grafana.wikimedia.org/d/000000585/hadoop HDFS Namenode Yarn Resource Manager on an-master1002: - systemctl stop hadoop-hdfs-namenode - systemctl stop hadoop-yarn-resourcemanager # Start the reimage disable puppet on an-master1001 and an-master1002 merge puppet patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/682785/ run puppet on install1003 to ensure this change is picked up Failover hdfs and yarn to an-master1002: - systemctl stop hadoop-hdfs-namenode - systemctl stop hadoop-yarn-resourcemanager - systemctl stop hadoop-hdfs-zkfc - systemctl stop hadoop-mapreduce-historyserver Check that an-master1002 is active as expected, wait a moment, check with team to make sure everything looks healthy Start reimage on cumin1001: sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet Since this has reuse-partitions-test, will have to connect to console and confirm that partitions look good (potentially destructive step, check with Luca before proceeding) Once machine comes up, confirm that proper os version is installed, hadoop services are running, /srv partition has data, node is in standby state. Since machine was down, hadoop namenode service will need to catch up. This should show in hdfs under-replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=41&orgId=1 Once everything looks good, manually failover 1002 -> 1001 (do this without stopping hdfs, so that if necessary things can switch back): sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet # Repeat reimage on an-master1002 Stop hadoop daemons on 1002, reimage, confirm that 1002 comes back as healthy standby, done!
Risk: an-master1002 does not work as active
Mitigation: do a test failover to an-master1002 and ensure everything is working before reimaging an-master1001, so that we can switch back if necessary
Risk: active fails while standby is down
Mitigation: Backup /srv/hadoop/name? Since hdfs is constantly written to, this would get out of date, but would be better than losing all data. We could set up another standby, an-master1003, perhaps temporarily as a virtual machine. Realistically this is a low-risk scenario, but worth considering as this would be the worst scenario and could lead to data loss
Risk: hadoop doesn't work on latest debian 10
Mitigation: an-test-master is already running on debian 10, so we have some confidence this will not happen; we can go over steps to reimage back to debian 9.13
Current an-master disk configuration:
razzi@an-master1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 46.6G 0 part │ └─md0 9:0 0 46.5G 0 raid1 / ├─sda2 8:2 0 954M 0 part │ └─md1 9:1 0 953.4M 0 raid1 [SWAP] └─sda3 8:3 0 176.1G 0 part └─md2 9:2 0 176G 0 raid1 └─an--master1001--vg-srv 253:0 0 176G 0 lvm /srv sdb 8:16 0 223.6G 0 disk ├─sdb1 8:17 0 46.6G 0 part │ └─md0 9:0 0 46.5G 0 raid1 / ├─sdb2 8:18 0 954M 0 part │ └─md1 9:1 0 953.4M 0 raid1 [SWAP] └─sdb3 8:19 0 176.1G 0 part └─md2 9:2 0 176G 0 raid1 └─an--master1001--vg-srv 253:0 0 176G 0 lvm /srv
# this workarounds LP #1012629 / Debian #666974 # it makes grub-installer to jump to step 2, where it uses bootdev d-i grub-installer/only_debian boolean false d-i grub-installer/bootdev string /dev/sda /dev/sdb d-i partman/reuse_partitions_recipe string \ /dev/sda|1 biosboot ignore none|2 raid ignore none, \ /dev/sdb|1 biosboot ignore none|2 raid ignore none, \ /dev/mapper/*-root|1 ext4 format /, \ /dev/mapper/*-srv|1 ext4 keep /srv
So we'll want to set it to reuse-parts-test.cfg to confirm
High level plan - check that everything is healthy: nodes on grafana, ensure active / standby is what we expect
- merge patch to set an-master1002 to reuse-parts-test.cfg with a custom partman/custom/reuse-analytics-hadoop-master.cfg. linux-host-entries.ttyS1-115200 already does not have a pxeboot entry so it will use buster upon reimaging
^- Do we want to add logical volumes for swap and root?
- stop hadoop daemons on an-master1002, downtime node, disable puppet
- run reimage, wait for node to come online, ensure things are healthy
- failover to newly-reimaged node, ensure things are still working
- repeat the steps to stop, update, and reimage an-master1001
Ok solved the mystery of sda1 / sdb1 on an-test-master1001: they are a bios boot partition.
razzi@an-test-master1001:~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 447.1G 0 disk ├─sda1 8:1 0 285M 0 part └─sda2 8:2 0 446.9G 0 part └─md0 9:0 0 446.7G 0 raid1 ├─vg0-root 253:0 0 74.5G 0 lvm / ├─vg0-swap 253:1 0 976M 0 lvm [SWAP] └─vg0-srv 253:2 0 371.3G 0 lvm /srv sdb 8:16 0 447.1G 0 disk ├─sdb1 8:17 0 285M 0 part └─sdb2 8:18 0 446.9G 0 part └─md0 9:0 0 446.7G 0 raid1 ├─vg0-root 253:0 0 74.5G 0 lvm / ├─vg0-swap 253:1 0 976M 0 lvm [SWAP] └─vg0-srv 253:2 0 371.3G 0 lvm /srv
razzi@an-test-master1001:~$ sudo fdisk -l Disk /dev/sdb: 447.1 GiB, 480103981056 bytes, 937703088 sectors Disk model: MZ7LH480HAHQ0D3 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: C5C638C5-0FFB-4A5C-A4C5-53A22860E315 Device Start End Sectors Size Type /dev/sdb1 2048 585727 583680 285M BIOS boot /dev/sdb2 585728 937701375 937115648 446.9G Linux RAID
Currently we have
d-i partman/reuse_partitions_recipe string \ /dev/sda|1 ext4 format /|2 linux-swap ignore none|3 unknown ignore none, \ /dev/mapper/*|1 xfs keep /srv, \ /dev/sdb|1 xfs keep /srv/backups/dumps/ongoing
which shows how to refer to / as ext4 without a logical volume `root` (though we might still want to do this).