Incidents/20150830-labs-nfs
Summary
CPU soft lockup on labstore1002 caused load to skyrocket and NFS mounts to fail. A reboot had two shelves not being assembed and sdaw device not readable, resulting in missing PVs (causing the LVs to be inactive). Another powercycle (via shutdown -h now and then a mgmt server up) did not help with the arrays, but the sdaw errors disappeared. A manual reassambly of the two shelves worked, and service was restored, for a total of 4h of downtime.
Timeline
(All times in UTC +2, 30 Aug 2015)
- 11:30: Shinken alerts about toollabs being down. A quick IRC check shows multichill reporting the same.
- 11:33: Yuvi ssh's to labstore1002, confirms high load. Sees 'NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:5:17071]'. <link-to-logs>.
- 11:40: Yuvi starts paging people.
- 11:50: godog responds and starts investigating
- 11:53: Yuvi saves kernel logs, reboots labstore1002.
- 12:04: labstore1002 comes back up, but with errors about activating all lvm volumes.
- 12:06: start-nfs fails, lvs reports device sdaw is unreadable.
- 12:26: After digging around a bit, godog attempts to activate tools volume. Fails citing missing PVs. godog continues investigation.
- 12:28: Yuvi verifies that backups to labstore2001 were working properly, and that latest backup was about 6h old
- 12:50: Valhallasw sets up error page on tools.wmflabs.org to announce NFS failure
- 13:17: Assuming it's the RAID controller being wonky again, labstore1002 is 'hard' rebooted, from mgmt. No more sdaw errors, but PVs are still missing - two arrays aren't assembled, can't mount LVs.
- 13:56: Yuvi pages Chirs, to attempt to get to the DC and do an inspection of the controller / maybe switchover to labstore1001.
- 14:04: Chris responds. Also tells us the last time multiple reboots fixed the controller, so Yuvi proposes to try that.
- 14:16: godog attempts to assemble the two missing arrays manually, which will restore service.
- 14:32: godog assembles the two missing arrays manually
- 14:48: manuall assembly seems to have worked - LVs are activateable now
- 14:51: After verifying that the LVs are mountable, godog runs start-nfs. Mounts still not back in labs instances, however.
- 15:02: Yuvi realizes we need to run the 'sync-exports' script as well
- 15:14: sync-exports by itself doesn't seem to have worked, needed a restart of NFS as well
- 15:15: NFS is back, and instances start recovering!
Action Items
- Switch NFS Server to labstore1001 bug T107038
- Fix / Test / Replace labstore1002 bug T95293
- Better documenation for our NFS setup bug T88723
- Fold sync-exports into nfs-exports daemon bug T102520
- Verify and add more checks for Labs NFS paging bug T101650