Incidents/20150413-LabsNFS

Summary

Labstore1001 became unresponsive to NFS because its block device subsystem was overloaded. A combination of factors caused a cascade failure resulting in complete resource starvation for NFS.

Timeline

17:30 Marc noticed NFS is slow during a fairly large copy, thinks the copy might be the cause
17:34 Initial diagnostic on labstore1001 show NFS is starved out of disk bandwidth
17:41 labstore1001 kernel stuck on kworker and bdflush processes, starved for IO
17:50 increase IO priority of NFS process hoping to improve interactive performance
18:00 minor improvements visible, some NFS service restored, but not much
18:17 _joe_ joins investigation
18:20 one of the shelves is rebuilding raid6, suspected of being an issue but reducing bandwidth does not visibly help
18:29 stoping NFS and unmounting filesystems to cold start labstore1001
18:33 cold start of labstore1001, go into MD800 BIOS to check diagnostics
18:36 No failures in the hardware reported, proceeding with boot
18:44 NFS back up, 
18:53 NFS spotty at best, very high iowait still noticable with disk usage pegged
19:02 bblack joins investigation
19:33 Another cold start to do the bootstrap manually, trying to isolate the problem component
19:38 attempt to let the raid resync proceed, would take 20h at current (overly slow) rate
20:05 mark suggest tuning stripe_cache_size to increase rebuild speed.  Increases efficiency tenfold.
20:08 noted how raid6 is bound to a single CPU in labstore1001's older kernel, no further improvement in raid6 speed possible
20:12 reduce rebuild speed to leave some IO bandwidth, restart NFS
20:22 NFS returns to reasonable working order, with some intermitent sluggishness
20:47 Most things return to working order, while Coren and Yuvi restore some services that did not survive the outage
21:12 All services back to normal, but iowait remains high
01:34 iowait on labstore1001 returns to normal patterns.  Cause unknown as rebuild still in progress.

Conclusions

There seems to be no single, isolable cause to the outage. Rather, a combination of factors seem to have resulted in the demand on disk bandwidth to exceed the capacity of the system to the point where cascading failure was reached. The kernel stripe_cache_size being set to the default (too small) kernel value amplified the drain on resources caused by the raid resync to the point where a buffer flush initiated by the kernel ended up starving all processes out of disk bandwidth. In addition, the older kernel (from Precise) has a bottleneck on raid6 checksum calculation being single-CPU bound that aggravated matters.

Actionables

Move to a more modern kernel as swiftly as practical https://phabricator.wikimedia.org/T94609
make certain the stripe_cache_size setting is puppetized and applied at boot https://phabricator.wikimedia.org/T96045
Formulate plans for getting off of raid6 for labs NFS storage. https://phabricator.wikimedia.org/T96063
Formulate plans for reducing unnecessary NFS I/O by pushing projects to use local storage for heavy i/o traffic. https://phabricator.wikimedia.org/T96065