Incidents/20160706-CI-Outage

Summary

Twice during the last few weeks, the Labs Nova scheduler has been unable to create new instances. Since the CI system depends on rapid creation and deletion of instances, these periods resulted in CI outages.

The first outage lasted about an hour on 2016-06-26. The second for 2.5 hours on 2016-07-05.

Timeline

[2016-06-22] Instances on labvirt1010 are spontaneously shutting down. Andrew deems this to be a result of RAM overcommitment, and removes 1010 from the scheduling pool.
[2016-06-22] Instances on labvirt1001 are spontaneously entering the 'error' state. This is the as-yet-undiagnosed bug T137857. Andrew removes 1001 from the scheduling pool as well, and sends an email explaining (poorly) about 1001 and 1010 being depooled.
[ensuing days] Quite a few resource-intensive instances are added to Labs.
[2016-06-26] The Labs scheduler can no longer schedule new instances as all available labvirt nodes have either all disk space or all RAM committed. For a brief period, NodePool is unable to create new instances, and CI tests fail to run. Andrew does a dist-upgrade and reboot on labvirt1011 and adds it to the scheduling pool to provide headroom, and instructs Labs admins to refrain from creating additional instances or projects.
[2016-07-03] Bug T137857 manifests on additional labvirt nodes. Due to Andrew's unclear email earlier, Yuvi interprets this as a scheduling/overcommitting issue and depools all instances except for labvirt1011. At this point the entire nodepool/CI setup is reliant entirely on only one labvirt node, 1011.
[2016-07-04] Nodepool pool is reduced from 20 to 10 instances since it had instances being scheduled on malfunction virts whenever it exceeded 10 instances.
[2016-07-05, early morning] Andrew returns from vacation, diagnoses and fixes bug T137857. He is not confident enough about the fix to repool labvirt1001-1009 immediately, though, rather he moves on to the task of trying to get new labvirt nodes online.
[2016-07-05, 19:00-19:40] Chris and Andrew work together to deploy labvirt1012, 1013, 1014. This includes some cabling and vlan changes adjacent to the running labvirt nodes. This may be unrelated but is included for completeness due to suspicious timing.
[2016-07-05, 19:30] Various icinga alerts start to fire for labvirt1011.
[2016-07-05, 19:30 - 22:30] During this period, labvirt1011's behavior is erratic. It is intermittently unreachable via ssh, sometimes unpingable, and sends frequent icinga alerts and recoveries. During some of these flaps, the nova scheduler is unable to contact it. One especially mysterious behavior: ssh connections to labvirt1011 provide an 'unknown host key' warning, as though traffic is being routed to a different host. Since labvirt1011 is the only active labvirt node, this means that CI is unable to schedule new instances, so locks up intermittently. Andrew and Chase spend a great deal of time trying to diagnose and solve this flapping, but reach no real conclusions.
[2016-07-05, 19:50] CI having troubles spawning instances, the backlog of jobs pills up. The delay for jobs slowly grows
[2016-07-05, 21:40] CI starts processing changes faster than they accumulate indicating labs is back. It would takes 50 minutes to bring down the queue.
[2016-07-05, 22:30] Admitting defeat, Andrew depools labvirt1011 and repools labvirt1001-1009 since they seem to no longer be suffering from bug T137857. At this point, labvirt1011 is still flapping but its downtime does not affect CI. CI processes resume normal function
[2016-07-05, 22:30] CI has finished processing the accumulated backlog of jobs.
Zuul Gearman queues from 19:40 to 21:40 UTC
[2016-07-05, 23:50] Chase reboots labvirt1011. It comes back online and no longer exhibits any flapping behavior. It remains depooled from the scheduled.
[2016-07-07, 14:00] Faidon and Moritz identify the cause of the labvirt1011 issues as a DHCP typo that interacted badly with Andrew's attempts to set up labvirt1012-1014. Details are in bug T139555

Conclusions

The Labs scheduling failures were the result of three main factors:

A bug with the libvirt version running on many of the compute nodes
Labs growth outstripping our hardware capacity
Insufficiently broad knowledge within the Labs team about Nova (which caused bug T137857 to go undiagnosed and unfixed for the full duration of Andrew's conference and vacation travels during which time the problem grew ever worse.)

The proximate cause of the outage was Labvirt1011 starting to flap. Without any of the previous problems this flapping would not have caused any service interruption. Nevertheless, more thorough code review on https://gerrit.wikimedia.org/r/#/c/297388/ would have prevented this part of the problem.

Previous to 07-03, labvirt1011 was held empty in reserve in case of emergencies. This was a good thing, since it was available to be dropped in and forestall the issue. Once labvirt1011 was brought into service we were already on thin ice, and it was mostly dumb luck that something random and weird broke before Andrew was back from vacation and able to restore things to their normal, redundant state.

Actionables

Establish targets for Labs VM capacity, and take active steps to ensure we always have surplus available capacity.

(Possibly) stop using RAM and Disk overprovisioning within the Labs scheduler. Overprovisioning makes it very difficult to predict when a given node will or won't fill up, meaning that often nodes drop out of the scheduling pool by surprise or, worse, suddenly run OOM due to unexpected and unavoidable behavior of their guest VMs.