Incidents/20150830-labs-nfs

From Wikitech

Summary

CPU soft lockup on labstore1002 caused load to skyrocket and NFS mounts to fail. A reboot had two shelves not being assembed and sdaw device not readable, resulting in missing PVs (causing the LVs to be inactive). Another powercycle (via shutdown -h now and then a mgmt server up) did not help with the arrays, but the sdaw errors disappeared. A manual reassambly of the two shelves worked, and service was restored, for a total of 4h of downtime.

Timeline

(All times in UTC +2, 30 Aug 2015)

  • 11:30: Shinken alerts about toollabs being down. A quick IRC check shows multichill reporting the same.
  • 11:33: Yuvi ssh's to labstore1002, confirms high load. Sees 'NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:5:17071]'. <link-to-logs>.
  • 11:40: Yuvi starts paging people.
  • 11:50: godog responds and starts investigating
  • 11:53: Yuvi saves kernel logs, reboots labstore1002.
  • 12:04: labstore1002 comes back up, but with errors about activating all lvm volumes.
  • 12:06: start-nfs fails, lvs reports device sdaw is unreadable.
  • 12:26: After digging around a bit, godog attempts to activate tools volume. Fails citing missing PVs. godog continues investigation.
  • 12:28: Yuvi verifies that backups to labstore2001 were working properly, and that latest backup was about 6h old
  • 12:50: Valhallasw sets up error page on tools.wmflabs.org to announce NFS failure
  • 13:17: Assuming it's the RAID controller being wonky again, labstore1002 is 'hard' rebooted, from mgmt. No more sdaw errors, but PVs are still missing - two arrays aren't assembled, can't mount LVs.
  • 13:56: Yuvi pages Chirs, to attempt to get to the DC and do an inspection of the controller / maybe switchover to labstore1001.
  • 14:04: Chris responds. Also tells us the last time multiple reboots fixed the controller, so Yuvi proposes to try that.
  • 14:16: godog attempts to assemble the two missing arrays manually, which will restore service.
  • 14:32: godog assembles the two missing arrays manually
  • 14:48: manuall assembly seems to have worked - LVs are activateable now
  • 14:51: After verifying that the LVs are mountable, godog runs start-nfs. Mounts still not back in labs instances, however.
  • 15:02: Yuvi realizes we need to run the 'sync-exports' script as well
  • 15:14: sync-exports by itself doesn't seem to have worked, needed a restart of NFS as well
  • 15:15: NFS is back, and instances start recovering!


Action Items

  1. Switch NFS Server to labstore1001 bug T107038
  2. Fix / Test / Replace labstore1002 bug T95293
  3. Better documenation for our NFS setup bug T88723
  4. Fold sync-exports into nfs-exports daemon bug T102520
  5. Verify and add more checks for Labs NFS paging bug T101650