Portal:Data Services/Admin/Runbooks/Storage issue on drbd standby

From Wikitech
The procedures in this runbook require admin permissions to complete.

Overview

There are two read/write NFS clusters at this time (00:40, 15 July 2021 (UTC)):

  • primary (tools and all home/project dirs other than maps) - labstore1004 and labstore1005
  • secondary (maps home/project dirs and scratch) - cloudstore1008 and cloudstore1009

Both are defined as clusters via DRBD in puppet.

If the DRBD standby host in a cluster has an issue and will be expected to be unreliable until repaired, it may be a good idea to shut down backups and disable DRBD replication.

Error / Incident

Any unreliability in the in the "DRBD Secondary" or standby host can cause latency for NFS clients. One good example of such a case when it is sensible to disconnect the DRBD standby is task T290318

Process

Disable alerts

Downtime DRBD status alerts. No need to upset everyone.

If this is the primary cluster, disable backups

cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet run backups against labstore1005 only. This should only be done when labstore1005 is the standby server, so we disable it during failovers. Check if the backup is running with systemctl status commands against the services mentioned below.

On cloudbackup2001.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-tools-project.service

On cloudbackup2002.codfw.wmnet:

  • sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
  • sudo systemctl disable block_sync-misc-project.service

If the backups are currently running, make a call whether to stop the backup (it's done only weekly) with systemctl or let if finish before proceeding.

Disconnect the DRBD secondary

In general, for monitoring DRBD while you work, this command is nice:

  • sudo drbd-overview

A good result looks like:

[bstorm@labstore1004]:~ $ sudo drbd-overview
 1:test/0   Connected Primary/Secondary UpToDate/UpToDate /srv/test  ext4 9.8G 535M 8.7G 6%
 3:misc/0   Connected Primary/Secondary UpToDate/UpToDate /srv/misc  ext4 5.0T 1.8T 3.0T 38%
 4:tools/0  Connected Primary/Secondary UpToDate/UpToDate /srv/tools ext4 8.0T 5.7T 2.0T 75%

Disconnect the volumes by running:

  • sudo drbdadm disconnect all on the active/primary host (eg. labstore1004 or cloudstore1008)

This declares the system to be a "Standalone" until you reconnect things. drbd-overview should now look like

[bstorm@labstore1004]:~ $ sudo drbd-overview
 1:test/0   StandAlone Primary/Unknown UpToDate/DUnknown /srv/test  ext4 9.8G 535M 8.7G 6%
 3:misc/0   StandAlone Primary/Unknown UpToDate/Outdated /srv/misc  ext4 5.0T 1.9T 2.9T 39%
 4:tools/0  StandAlone Primary/Unknown UpToDate/Outdated /srv/tools ext4 8.0T 6.5T 1.2T 86%

After things are fixed

When the problem is fixed and the standby is stable again, you can run:

  • sudo systemctl start drbd on the standby, if it shows "Unconfigured" in drbd-overview
  • sudo drbdadm connect all on the active host, and drbd-overview should show a resync happening.

Support contacts

Anyone doing this will almost certainly be on the Cloud Services team, so find a coworker!

Related information

Portal:Data_Services/Admin/Shared_storage#Clusters https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-admin-manual