Incidents/20150519-LabsOutage

Summary

Labvirt1006 failed at around 18:20 on May 19th. All hosted instances become unresponsive. The system was rebooted and all instances restarted; normal service was resumed by 18:40.

Timeline

[ 16:00 ] Andrew restarts a script that suspends/resumes instances affected by the Venom issue. This was also happening when Incident_documentation/20150518-LabsOutage happened.

[ 18:20 ] Shinken sent 'host down' alerts for a couple of tools instances. Yuvi notes that all of them are on on labvirt1006 and this is probably a repeat of Incident_documentation/20150518-LabsOutage

[ 18:31 ] Yuvi reboots labvirt1006

[ 18:36 ] labvirt1006 comes up after POST

[ 18:40 ] Yuvi runs a scripted 'start' of each instance formerly running on labvirt1001

[ 18:45 ] All instances have resumed normal operation

Aftermath

bblack noted that this is probably a kernel issue - there was a GPF in the kernel log related to XFS and one about Virtual Memory (log in /home/yuvipanda/kernlog-20150519-outage on labvirt1006). Apparently the Virtual Memory subsystems are kind of terrible in kernel series until 3.19, so this might be related.

Action items

Make Tools Redis (toollabs redis was the only affected service) hot swappable as well - https://phabricator.wikimedia.org/T99737
Investigate the kernel issues and find a potential solution https://phabricator.wikimedia.org/T99738