Incidents/20150625-LabsNFSOutage
Appearance
(Redirected from Incident documentation/20150625-LabsNFSOutage)
Summary
There was a 12 minute Labs NFS outage (and subsequently Tools outage) caused by human error (reboot
entered on the wrong terminal accidentally). The system came back up, the arrays were assembled and NFS was back up in 12 minutes.
Timeline
16:35: Yuvi accidentally types reboot
into labstore1002
instead of a labs instance awaiting NFS removal.
16:38: Machine comes back up, Coren starts assembling arrays and bringing the NFS server back up
16:47: Instances all recovered from NFS stall, everything working ok
Actionables
- Feed Yuvi to a Moose Not done Yuvi, we love you!
- Install the molly-guard package on production hosts https://phabricator.wikimedia.org/T103873 Done