Obsolete:Portal:Data Services/Admin/Runbooks/Storage issue on drbd standby
Overview
There are two read/write NFS clusters at this time (00:40, 15 July 2021 (UTC)):
- primary (tools and all home/project dirs other than maps) - labstore1004 and labstore1005
- secondary (maps home/project dirs and scratch) - cloudstore1008 and cloudstore1009
Both are defined as clusters via DRBD in puppet.
If the DRBD standby host in a cluster has an issue and will be expected to be unreliable until repaired, it may be a good idea to shut down backups and disable DRBD replication.
Error / Incident
Any unreliability in the in the "DRBD Secondary" or standby host can cause latency for NFS clients. One good example of such a case when it is sensible to disconnect the DRBD standby is task T290318
Process
Disable alerts
Downtime DRBD status alerts. No need to upset everyone.
If this is the primary cluster, disable backups
cloudbackup2001.codfw.wmnet
and cloudbackup2002.codfw.wmnet
run backups against labstore1005 only. This should only be done when labstore1005 is the standby server, so we disable it during failovers. Check if the backup is running with systemctl status
commands against the services mentioned below.
On cloudbackup2001.codfw.wmnet:
sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
sudo systemctl disable block_sync-tools-project.service
On cloudbackup2002.codfw.wmnet:
sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
sudo systemctl disable block_sync-misc-project.service
If the backups are currently running, make a call whether to stop the backup (it's done only weekly) with systemctl or let if finish before proceeding.
Disconnect the DRBD secondary
In general, for monitoring DRBD while you work, this command is nice:
sudo drbd-overview
A good result looks like:
[bstorm@labstore1004]:~ $ sudo drbd-overview
1:test/0 Connected Primary/Secondary UpToDate/UpToDate /srv/test ext4 9.8G 535M 8.7G 6%
3:misc/0 Connected Primary/Secondary UpToDate/UpToDate /srv/misc ext4 5.0T 1.8T 3.0T 38%
4:tools/0 Connected Primary/Secondary UpToDate/UpToDate /srv/tools ext4 8.0T 5.7T 2.0T 75%
Disconnect the volumes by running:
sudo drbdadm disconnect all
on the active/primary host (eg. labstore1004 or cloudstore1008)
This declares the system to be a "Standalone" until you reconnect things.
drbd-overview
should now look like
[bstorm@labstore1004]:~ $ sudo drbd-overview
1:test/0 StandAlone Primary/Unknown UpToDate/DUnknown /srv/test ext4 9.8G 535M 8.7G 6%
3:misc/0 StandAlone Primary/Unknown UpToDate/Outdated /srv/misc ext4 5.0T 1.9T 2.9T 39%
4:tools/0 StandAlone Primary/Unknown UpToDate/Outdated /srv/tools ext4 8.0T 6.5T 1.2T 86%
After things are fixed
When the problem is fixed and the standby is stable again, you can run:
sudo systemctl start drbd
on the standby, if it shows "Unconfigured" indrbd-overview
sudo drbdadm connect all
on the active host, anddrbd-overview
should show a resync happening.
Support contacts
Anyone doing this will almost certainly be on the Cloud Services team, so find a coworker!
Related information
Portal:Data_Services/Admin/Shared_storage#Clusters https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-admin-manual