Portal:Data Services/Admin/Runbooks/Resync a drbd volume

From Wikitech
The procedures in this runbook require admin permissions to complete.

Overview

There are two read/write NFS clusters at this time (00:40, 15 July 2021 (UTC)):

  • primary (tools and all home/project dirs other than maps) - labstore1004 and labstore1005
  • secondary (maps home/project dirs and scratch) - cloudstore1008 and cloudstore1009

Both are defined as clusters via DRBD in puppet. If replication is interrupted badly or the standby server is suspected of corruption (evidenced by LVM issues, disk problems, or however one comes to such conclusions) it can be necessary to reconnect the pair together aggressively by invalidating the standby server's copy of the data.

Error / Incident

If you cannot get DRBD replication in sync (sudo drbd-overview doesn't ever suggest that the two are in sync and replication shows errors) or if you are getting corrupted volumes after backups (which could be caused by other errors on the backup side), you may want to try this operation. If you find yourself doing this often, you probably need to fix the standby server or the DAC network connection between the active and standby because this should not be something you do often. It has been done once in three years, for instance.

Process

Disable alerts

Downtime DRBD status alerts. No need to upset everyone.

If this is the primary cluster, disable backups

cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet run backups against labstore1005 only. This should only be done when labstore1005 is the standby server, so we disable it during failovers. Check if the backup is running with systemctl status commands against the services mentioned below.

On cloudbackup2001.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-tools-project.service

On cloudbackup2002.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-misc-project.service

If the backups are currently running, make a call whether to stop the backup (it's done only weekly) with systemctl or let if finish before proceeding.

Invalidate the DRBD Secondary

In general, for monitoring DRBD while you work, this command is nice:

  • sudo drbd-overview

A good result looks like:

[bstorm@labstore1004]:~ $ sudo drbd-overview
 1:test/0   Connected Primary/Secondary UpToDate/UpToDate /srv/test  ext4 9.8G 535M 8.7G 6%
 3:misc/0   Connected Primary/Secondary UpToDate/UpToDate /srv/misc  ext4 5.0T 1.8T 3.0T 38%
 4:tools/0  Connected Primary/Secondary UpToDate/UpToDate /srv/tools ext4 8.0T 5.7T 2.0T 75%

Invalidating the secondary is as simple as running: drbdadm invalidate all ON THE STANDBY HOST. This declares the disks there to be in need of overwriting from scratch. I hope the drbdadm will stop you from running this on the primary/active host, but nevertheless never run it on the current active host under any circumstances.

Wait for things to sync back up (this will take a while and will slow down writes for a bit).

If all volumes show UpToDate/UpToDate, you should be good to go.

Support contacts

Anyone doing this will almost certainly be on the Cloud Services team, so find a coworker!

Related information

Portal:Data_Services/Admin/Shared_storage#Clusters https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-admin-manual