User:Bstorm/plans/labstore-upgrades

From Wikitech
This is not actually how this all went down. This was brainstorming.

Primary cluster

The primary cluster is currently on 1GB Ethernet and doesn't seem to have a definition in install_server.

We may want to add ipv6, and we probably should put it on 10G Ethernet. To that end, labstore1004 is already in a 10G rack, but labstore1005 not and would need to move. I doubt they have 10G Ethernet cards installed, though. They have 10G cards installed, so this should be done, which implies we should re-image the servers.

There are three hardware RAID volumes on the devices. One that's around a TB for root and swap. The other two are data volumes for NFS/DRBD. I think we should leave the name-change to cloudstore out of this round. That would add a variable that would complicate the change for clients.

That seems like it might be relatively simple to re-image safely. So we should:

  • Set up a partman recipe to reimage from
  • Stop backups
  • Stop puppet on labstore1004
  • Re-image to labstore1005 stretch, merging a config enabling IPv6 for firewall and DNS
  • Fail over to labstore1005
  • Re-image labstore1004 and enable puppet
  • Fail back to labstore1004
  • Re-enable backups once checks are complete

Dumps cluster

I confirmed that the nfsd-ldap custom package doesn't apply to this cluster because this cluster has all_squash set. That does allow us to upgrade to buster instead of stretch for this case. If upgrading in place, we'd best upgrade from Jessie to Stretch and then from Stretch to Buster.

If puppet was already installed, but it cannot finish the postinst script because it is trying to run its own postinst script as part of a puppet run which is part of it's postinst script, etc. run "rm /var/lib/dpkg/info/puppet.postinst" to clear that up at least
  • Fail over all services to labstore1006
  • downtime labstore1007
  • disable puppet on labstore1007
  • apt-get upgrade and dist-upgrade...check that all is ok
  • switch the main sources.list to stretch
  • upgrade/dist-upgrade
  • enable puppet and run until things work right
  • fix the broken
  • This will include doing sudo rm /opt/puppetlabs/facter/cache/cached_facts/operating\ system on facter 3.11
  • validate the config, reboot, etc.
  • upgrade again to buster
  • Fail all services over to labstore1007
  • Upgrade labstore1006 (wash, rinse, repeat)
  • Return to the usual combined services spread