Incidents/20150519-LabsOutage
Appearance
Summary
Labvirt1006 failed at around 18:20 on May 19th. All hosted instances become unresponsive. The system was rebooted and all instances restarted; normal service was resumed by 18:40.
Timeline
- [ 16:00 ] Andrew restarts a script that suspends/resumes instances affected by the Venom issue. This was also happening when Incident_documentation/20150518-LabsOutage happened.
- [ 18:20 ] Shinken sent 'host down' alerts for a couple of tools instances. Yuvi notes that all of them are on on labvirt1006 and this is probably a repeat of Incident_documentation/20150518-LabsOutage
- [ 18:31 ] Yuvi reboots labvirt1006
- [ 18:36 ] labvirt1006 comes up after POST
- [ 18:40 ] Yuvi runs a scripted 'start' of each instance formerly running on labvirt1001
- [ 18:45 ] All instances have resumed normal operation
Aftermath
bblack noted that this is probably a kernel issue - there was a GPF in the kernel log related to XFS and one about Virtual Memory (log in /home/yuvipanda/kernlog-20150519-outage on labvirt1006). Apparently the Virtual Memory subsystems are kind of terrible in kernel series until 3.19, so this might be related.
Action items
- Make Tools Redis (toollabs redis was the only affected service) hot swappable as well - https://phabricator.wikimedia.org/T99737
- Investigate the kernel issues and find a potential solution https://phabricator.wikimedia.org/T99738