Incident documentation/20151216-Labs-NFS

From Wikitech
Jump to: navigation, search

Summary

The primary NFS server for Labs, labstore1001, crashed due to unconfirmed reasons - likely a Linux kernel bug. Kernel LVM limitations made its recovery after the reboot very slow, which initially was misdiagnosed as a hardware failure. After thorough investigation, it was determined that slow boot times are to be expected with the current config. After sufficient time was passed, and in combination with configuration adjustments to minimize said delays, the system booted up back normally and service was restored.

https://phabricator.wikimedia.org/tag/incident-labs-nfs-20151216/

Timeline

04:50 labstore1001 enters a soft lockup on a CPU core, in kernel module dm_mod. Labs NFS becomes unavailable.

04:52 Andrew B and Yuvi, while investigating another issue, respond.

04:56 Ori calls Mark

05:00 Mark starts investigating labstore1001

05:10 Mark reboots labstore1001 to get it out of the lockup

05:15 labstore1001 gets stuck on bootup on activation of LVM2 volumes, without any output. MD RAID reports failure to assemble (some?) arrays.

05:30 Another reboot is attempted, as there's no visible progress after 15+ minutes. Further reboots get stuck in the same way. Mark attempts to disable LVM2 activation during boot. These attempts keep failing until eventually access to a rescue prompt is obtained. 

06:14: Mark calls Faidon, Faidon joins the investigation

06:17: Mark logs in over console

06:49: Mark boots system with systemd.mask=lvm2-activation.service systemd.mask=lvm2-activation-early.service, boots up, SSH is available

06:59: Mark tries vgchange -ay manually; reports seeing no output

07:17: Faidon notices that snapshots are getting gradually (but very slowly) activated

07:45: All four tools snapshots finish initialization

08:00: Giuseppe/Mark mount the tools filesystem, journal is recovered, no data loss observed

08:13: Mark drops all snapshots but the latest one (per volume); sets -k to all volumes

08:17: Mark reboots, this time without systemd.mask'ing LVM

08:21: System successfully boots, without activating LVs (as intended); Mark manually activates tools

08:24: Tools LVs finish activation

08:26: All LVs finish activation

08:28: Yuvi runs start-nfs; notices errors about missing filesystem definitions in /etc/fstab (a server misconfiguration)

08:31: Faidon fixes /etc/fstab, re-runs start-nfs

08:45: Yuvi and Giuseppe notice errors in Labs instances, e.g. mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/deployment-prep/project failed, reason given by server: No such file or directory

08:50: Yuvi stops the NFS kernel server again

08:53: Yuvi finds the root cause (sync-exports needs a nfs-kernel-server restart), fixes it and starts the NFS server again

Conclusions

Regular LVM snapshots take a long time to initialize. This is to be expected. Having no SSH nor a serial console during initialization impedes troubleshooting. Therefore, all LVM modules were adjusted to not initialize at boot, but were configured to be initialized manually by the system administrators instead, right before mounting filesystems and starting NFS (all part of the "start-nfs" script).

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

  • {{Done}} Set all logical volumes as not to be initialized at boot (-k); adjust start-nfs to activate all LVs before mounting filesystems
  • {{Done}} Add filesystems to labstore1001's /etc/fstab
  • {{Done}} Add set -e to start-nfs to fail early on errors
  • {{Done}} sync-exports needs a nfs-kernel-server restart
  • Status:    In progress Reinstall labstore1002 with the same configuration as labstore1001 (/ not on LVM etc.) (bug T121905)
  • Status:    In progress Fix labstore1002 problems for reboot (bug T98183)
  • Status:    In progress kernel upgrade for labstore* (bug T121903)
  • Status:    In progress Add step in start-nfs to ask operator to consider dropping some snapshots (bug T121890)
  • Status:    In progress Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot (bug T121629)