User:Razzi/new plan for reimaging an-masters

update the haadmins to use kerberos-run-command

respond: purpose of failover: just to test that both nodes are healthy

gotta add timers to systemctl commands. testing this out now

> sudo systemctl list-units 'camus-*.timer'

ok looks good

need to plan to contact search team to ask to pause

remove part of plan fo stopping oozie coordinators

need to make sense of

This command is a little bit brutal, what we could do is something like: - check `profile::analytics::cluster::hadoop::yarn_capacity_scheduler` and add something like `'yarn.scheduler.capacity.root.default.state' => 'STOPPED'` - send puppet patch and merge it (but at this point we are with puppet disabled, so you either add it manually or you merge it beforehand). - execute `sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues`

The above will instruct the Yarn RMs to not accept any new job.

! need to learn about transfer.py and incorporate it into plan

Add to plan step to stop hadoop-namenode timers

Add to plan step to check for hdfs / yarn processes before changing uids/gids

need to make sense of

> Reimage an-master1002 > - `sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet` > - Will have to confirm that the partitions since we're using reuse-parts-test

At this point I would probably think about checking that all daemons are ok, logs are fine, metrics, etc.. It is ok to do the maintenance in two separate days, to leave some time for any unexpected issue to come up. We could also test a failover from 1001 (still not reimaged) to 1002 and leave it running for a few hours monitoring metrics. I know that on paper we don't expect any issue from previous tests, but this is production and there may be some corner cases that we were not able to test before.

So to summarize - at this point I'd check the status of the all the services and just re-enable timers/jobs/etc.. Then after a bit I'd failover to 1002 and test for a few hours if everything works as expected (heap pressure, logs, etc..)

Yes, 1001 should be active here

Update plan to restart puppet

Update plan to use stat100x for backup