Portal:Data Services/Admin/Runbooks/Failover an NFS cluster

From Wikitech
This page is useful for the bare-metal NFS servers providing wiki dumps. Other NFS servers have been moved off of hardware; the failover process there is documented at Portal:Data Services/Admin/Runbooks/Create an NFS server.
The procedures in this runbook require admin permissions to complete.

Overview

There are two read/write NFS clusters at this time (00:40, 15 July 2021 (UTC)):

  • primary (tools and all home/project dirs other than maps) - labstore1004 and labstore1005
  • secondary (maps home/project dirs and scratch) - cloudstore1008 and cloudstore1009

The main difference between the two is that primary is backed up, and secondary is not. (see also Portal:Data_Services/Admin/Shared_storage)

Error / Incident

Any time the active NFS server must be taken offline for maintenance or errors, a failover should be considered. To determine if you should failover or not, there are two pieces to this:

  • A failover can take several minutes to complete due to time spent waiting for unmount on the active host and NFS startup on the replica and (at least at this time). Will your change simply stop access to NFS for less than a couple minutes? You might not want to go to the trouble of failing over.
  • Is it safe to keep the data live on this host? DRBD replication is active-passive. When you fail over, the active host will unmount the volume and become a replica. A replica host can even have a full data loss and be completely resynced from nothing later, if need be (see DRBD operations doc that will be written soon). The active host cannot tolerate such an event as the damaged data could be replicated. If any data loss event began on a DRBD-primary/active host, you'd want to follow another playbook that is being written soon.

Process

Disable alerts

Downtime both servers in the cluster before proceeding. You will set off alerts otherwise.

Stop puppet

sudo -i puppet agent --disable "<myname>: failing over cluster for maintenance" Puppet may revert changes or do other odd things if you leave it running.

If this is the primary cluster, disable backups

cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet run backups against labstore1005 only. This should only be done when labstore1005 is the standby server, so we disable it during failovers. Check if the backup is running with systemctl status commands against the services mentioned below.

On cloudbackup2001.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-tools-project.service

On cloudbackup2002.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-misc-project.service

If the backups are currently running, make a call whether to stop the backup (it's done only weekly) with systemctl or let if finish before proceeding.

Check DRBD

Back on the NFS server, you'll want to be sure that DRBD is in a good, replicated state before you screw with it. Run:

  • sudo drbd-overview

A good result looks like:

[bstorm@labstore1004]:~ $ sudo drbd-overview
 1:test/0   Connected Primary/Secondary UpToDate/UpToDate /srv/test  ext4 9.8G 535M 8.7G 6%
 3:misc/0   Connected Primary/Secondary UpToDate/UpToDate /srv/misc  ext4 5.0T 1.8T 3.0T 38%
 4:tools/0  Connected Primary/Secondary UpToDate/UpToDate /srv/tools ext4 8.0T 5.7T 2.0T 75%

If the volumes listed (other than test, which really never matters) are not "Connected", stop and sound the alarm. The first value after that tells you the role of this server (primary=active server, secondary=standby replica). When you start, labstore1004 or cloudstore1008 should be the "Primary" server for DRBD. We only failover temporarily without a significant puppet reconfiguration.

If all volumes show UpToDate/UpToDate, you should be good to go.

Bring down the active server's processes

Active-active NFS is a myth. This is a STONITH process (shoot the other node in the head). Since we aren't using corosync/pacemaker or similar at this time, we use a manual script nfs-manage.

Run sudo /usr/local/sbin/nfs-manage down

This may not finish cleanly. I'm sorry. If it does not manage to cleanly unmount the volumes, you can actually just run it again and it likely will work. It doesn't do anything that will cause problems if run a second time.

Ensure that this finishes cleanly before proceeding, but know that it shuts down NFS, so the clock is ticking.

Go ahead and check sudo drbd-overview to ensure both sides are now "Secondary" before moving on. Two Primaries == split-brain

Bring up NFS on the standby server

On the standby server (labstore1005 or cloudstore1009 depending on which cluster), you should simply have to run sudo /usr/local/sbin/nfs-manage up. It takes a distressing amount of time for nfs-server to start, but it normally will.

Check sudo drbd-overview to verify that all is well, and the standby server now thinks it is "Primary".

Here the normally-active server is ready for maintenance tasks

It is safe to reboot the server and do a lot of things to it. Be careful with the DRBD disks, they are still acting as replicas.

When you are done with everything, check sudo drbd-overview. If the volumes are not connected (eg. shows a status like WFConnection), on the server that is normally the active system but is now "Secondary", try running sudo drbdadm connect all. If that doesn't change your fortunes, get some help if you aren't comfortable troubleshooting DRBD.

When the status for DRBD looks good, and data is "UpToDate", you can start failing back.

Bring down the standby server's NFS processes

Run sudo /usr/local/sbin/nfs-manage down.

The same rules apply here if a mount won't unmount. It is reasonably safe to run this script again in a moment.

Bring up NFS on the active server, where it should be

Run sudo /usr/local/sbin/nfs-manage up.

Check DRBD again

Repeat #Check DRBD. It should look the same as when you started.

Enable puppet on all servers

This should put back all the things you've done, like disabling backups and killing processes with the nfs-manage script.

End your downtime, if needed

You should be done!

Fix any VMs that have lost their minds

Depending on how long that took, the kernel versions in play and the favor of the gods, you might need to reboot some VMs that just didn't re-mount things correctly. Other times, you just need to wait a little bit because there are some long timeouts.

Support contacts

At this time, your main escalation point is bstorm

Related information

Portal:Data_Services/Admin/Shared_storage#Clusters