Incidents/20150331-LabsNFS-Filesystem-Switch

Summary

The planned switch of the filesystem underlying the Labs NFS service unexpectedly caused instances running Ubuntu Precise to require rebooting rather than a simple filesystem remount as expected. As a consequence, a large number of Labs projects (including Tool Labs) were negatively impacted during diagnosis and restart of those instances. In addition, the large number of instances being restarted caused the Labs virtualization infrastructure to overload, slowing recovery down.

Normal operation returned around 22h UTC for Tool Labs, and Labs generally around 00h UTC, with minor issues remaining consistent with a Labs-wide outage that were gradually fixed as they were reported. The rsync did cause another outage a few hours later, however.

Timeline

21:00 turned NFS off
21:01 rotated the filesystems between the old (flat) one and the new (thin) one
21:03 turned NFS back on
          At that point, instances were expected to no longer be able to access files
21:05 unmount and remount the NFS filesystems on the instances of the tools project (which was first in line)
          At that point, Trusty instances reacted as planned (quickly recovered, with filesystem available) but Precise instances ended up being unable to detatch, misoperating on the former filesystem
21:12 Diagnosing the issue took place, with bastion-restricted-01 as the guinea pig (still running Precise)
21:25 Confirmed that unmounting the previous filesystem was broken (rather than unmount, the mountpoint was converted to a _(deleted) faux-file that cannot be operated upon
21:28 Attempted a reboot of bastion-restricted-01 to see if the bootstrap mounts would work - they did.
21:36 Confirmed that precise instances could be fixed by a reboot by doing so on tools-master.  Also confirmed.
21:41 Proceed to reboot all of tool's precise instances in an order intended to speed recovery as much as possible
21:45 Yuvi proceeds to reboot affected instances of deployment-prep
21:56 Confirmation that tool labs is recovering
22:01 Restarted most jobs as gridengine recovers, web services mostly back online
22:10 Some reports come in of older versions of file being visible; outage seems to have caused the loss of the most recent sync
22:11 Yuvi proceeds to use salt to remount filesystem on Trusty and Jessie instances.  Salt succeeds but fails to report success
22:23 Andrew generates a list of Precise instances to reboot, which are then rebooted at 5s interval
22:30 Tremendous load on virtualization hosts as instances reboot (sometimes many to a host) cause regular hangs of groups of instances, but manages to push through
22:53 rsync started to recover the last changes from the old filesystem
23:30 Andrew halts a few especially greedy instances to ease CPU load on virt1004 and virt1011
23:35 Labs stabilizes
00:09 After some minor point fixes, we declare Labs to be back up
13:42 run rsync one last time to ensure no out of date files
15:20 final filesystem rsync confirms no out of date files

Conclusions

The outage to Precise instances seems to have been unavoidable, but if that configuration had been tested we might have been to plan for it and schedule a proper maintenance window for the restarts as opposed to having to recover from an unplanned outage.

Actionables

Create a method to schedule restart of instances in a way that cannot overload virtualization hosts by staggering them https://phabricator.wikimedia.org/T94613
Create 'checklists' for all planned maintenance that we should be followed. https://phabricator.wikimedia.org/T94608
Make certain to include all extant OS flavors when testing a planned change to the infrastructure. While all new installs are Trusty and Jessie, there remains a number of other releases that may react differently in the fleet. (Should be part of checklist)
Schedule a lot more time for any planned maintenance on NFS, no matter how trivial (should be part of checklist)