Obsolete:Labs NFS

This page may be outdated or contain incorrect details. Please update it if you can.

We have recently migrated off of this setup

NFS is served to eqiad labs from one of two servers (labstore1001 and labstore1002) which are connected to a set of five MD1200 disk shelves.

Hardware setup

Each server is (ostensibly, see below) connected to all five shelves, with three shelves on one port of the controller and two shelves on the other. Each shelf holds 12 1.8TB SAS drives, and the controller is configured to expose them as single-disk raid 0 to the OS (The H800 controller does not support actual JBOD configuration). In addition, both servers have (independently) 12 more 1.8TB SAS drives in the internal bays.

The shelves are currently disconnected from labstore1001 since the July outage as we no longer trust the OS to not attempt to assemble the raid arrays simultaneously - this is intended to return once SCSI reservation has been tested.

The internal disks are visible to the OS as /dev/sda to /dev/sdl, and the shelves' disks are /dev/sdm to /dev/sdbt. (A quick early diagnostic is visible at the end of POST as the PERCs start up; normal operation should report 72 exported disks).

Software RAID

The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered md122-md126.

When the raid arrays were originally constructed, they were named arbitrarily according to the order in which they were connected (since, at the time, each shelf was a self-contained raid6 array) as shelf1-shelf4 matching labstore1001-shelf1 to labstore1001-shelf4. When a fifth shelf was installed, requiring a split between two ports, labstore1001-shelf4 was renamed to labstore1002-shelf1 and the new shelf was added as labstore1002-shelf2 (and named shelf5).

This naming was kept conceptually when the raids were converted to raid 10:

/dev/md/shelf32 (First 6 drives of shelf3, last 6 drives of shelf2)
/dev/md/shelf23 (First 6 drives of shelf2, last 6 drives of shelf3)
/dev/md/shelf51 (First 6 drives of shelf5, last 6 drives of shelf1)
/dev/md/shelf15 (First 6 drives of shelf1, last 6 drives of shelf5)
/dev/md/shelf44 (All 12 drives of shelf4 (6-6))

There is one shelf that is known to have had issues with the controller on labstore1002 (shelf4, above), which was avoided in the current setup and is not currently used.

In addition, the first two drives of the internal bay are configured as a raid1 (md0) for the OS.

LVM

Each shelf array is configured as a LVM physical volume, and pooled in the labstore volume group, from which all shared volumes are allocated.

There is still a backup volume group containing the internal drives of labstore1002 (not counting the OS-allocated drives) that contains old images – but that VG is not in active use anymore.

The labstore volume contains four primary logical volumes:

labstore/tools, shared storage for the tools project
labstore/maps, shared storage for the maps project
labstore/others, containing storage for all other labs project
labstore/scratch, containing the labs-wide scratch storage

Conceptually, the volumes are mounted under /srv/{project,others}/$project, with /srv/others being the mountpoint of the "others" volume, and the project-specific volumes mounted under /srv/project/; this is configured in /etc/fstab and must be adjusted accordingly if new project-specific volumes are made.

In addition to the shared storage volume, the volume group also contains transient snapshots made during the backup process.

NFS Exports

NFS version 4 exports from a single, unified tree (/exp/ in our setup). This tree is populated with bind mounts taking the various subdirectories of /srv and kept in sync with changes there by the /usr/local/sbin/sync-exports. This is matched with the actual NFS exports in /etc/exports.d, one file per project.

One huge caveat that needs to be noted: it is imperative that sync-exports be executed before NFS is started, as this sets up the actual filesystems to be exported (through the bind mounts) - if NFS is started before that point any NFS client will notice the changed root inode and will remain stuck in "stale NFS handle" errors until a reboot (whereas they should otherwise be able to recover from any outage since all NFS mounts are hard).

Actual NFS service is provided through a service IP (distinct from the servers') which is set up by the start-nfs script as the last step before actual the actual NFS server - this allows the IP to be moved to whichever server is the active one. Provided that the same filesystems are presented, the clients will not even notice the interruption in service.

Backups

Backups are handled through systemd units, invoked by timers. Copies are made by (a) making a snapshot of the filesystem, (b) mounting it readonly and (c) doing a rsync to codfw's labstore to update that copy.

Every "true" filesystem is copied daily though the replicate-maps, replicate-tools, replicate-others units for each respectively named filesystem. The snapshots are kept until full, and cleaned up by the cleanup-snapshots-labstore unit.

There are icinga alerts for any of those units not having been run (successfully) in the past 25 hours.

Diagnostics

Overuse from clients

NFS provides very little load information per-user, but in case the load becomes abnormally high, it is generally possible to find the culprit:

iftop or iptraf on labstore1001; outliers tend to be obvious by an order of magnitude.
dig -x $IP from a labs host to identify which
on the instance, iotop will help track down the outliers
you can kill the offending processes at need, or track down the user to verify what is up

Failover

It is possible to switch NFS service from one of the labstore to the other (they serve as cold failover to each other). Doing so, at this time, requires a physical intervention at the DC (but which is simple enough to hand off to smart hands).

shut down the currently active server
disconnect the active server from the shelves, and connect the failover one
reboot the failover server (to ensure the RAID controller reinitializes), and use start-nfs on the failover

This operation will move the service IP to the now-active server and resume operation.

A caveat was noticed when doing a switch in the past: it is possible that some or all of the backup snapshot volumes refuse to activate when created on different server (the exact underlying cause for this has not yet been identified). If this happens, booting will stall at the 'Activating LVM volumes' phase. The fix is simple: since the volumes are explicitly expendable, simply first booting in single-user mode and deleting the snapshots (with lvremove suffices).