Incident documentation/20150519-LabsOutage

From Wikitech
Jump to: navigation, search

Summary

Labvirt1006 failed at around 18:20 on May 19th. All hosted instances become unresponsive. The system was rebooted and all instances restarted; normal service was resumed by 18:40.

Timeline

  • [ 18:31 ] Yuvi reboots labvirt1006
  • [ 18:36 ] labvirt1006 comes up after POST
  • [ 18:40 ] Yuvi runs a scripted 'start' of each instance formerly running on labvirt1001
  • [ 18:45 ] All instances have resumed normal operation

Aftermath

bblack noted that this is probably a kernel issue - there was a GPF in the kernel log related to XFS and one about Virtual Memory (log in /home/yuvipanda/kernlog-20150519-outage on labvirt1006). Apparently the Virtual Memory subsystems are kind of terrible in kernel series until 3.19, so this might be related.

Action items