Incidents/20160810-CI

On August 10 most of CI crawled to a near halt starting at (roughly) noon pacific time. This is due to issues between Nodepool (which runs a large portion of the tests now) and our OpenStack infrastructure.

TImeline

Graphs

https://phabricator.wikimedia.org/F4352716 - Gate processing time - 2016-08-10 - 18:00 - 06:00 UTC
https://phabricator.wikimedia.org/F4352717 - Max launch wait per nodepool label - 2016-08-10 - 18:00 - 06:00 UTC

2016-08-10 21:09: Paladox and greg-g discuss CI being slow
2016-08-10 21:10: thcipriani logs onto labnodepool1001 and checks, instances seem to be building according to nodepool list
2016-08-10 21:36: legoktm notes that there seem to be no trusty nodes being built by nodepool
2016-08-10 21:38: thcipriani logs into labnodepool1001 to check, there are, nominally, trusty nodes being built by nodepool
Nodepool continues to attempt to build instances, those instances are not being used by jobs
2016-08-10 21:59: <thcipriani> restarted nodepool, no trusty instances were being used by jobs
- This fixed the issue temporarily, building of instances continues VERY SLOWLY
2016-08-10 22:54: instance creation/deletion in nodepool seems very stuck
2016-08-10 22:56: <andrewbogott> thcipriani: that's one of the labvirt hosts acting up, I think yuvi is going to depool it
2016-08-10 23:40: <legoktm> deploying https://gerrit.wikimedia.org/r/304131 - Temporarily move composer-hhvm/php5 jobs off of nodepool
2016-08-10 23:47: <thcipriani> stopping nodepool to clean up
2016-08-10 23:56: <legoktm> deploying https://gerrit.wikimedia.org/r/304149 - Move mediawiki-core-phpcs off of nodepool
2016-08-11 00:02: < icinga-wm> RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
- Puppet restarts nodepool, things seem a lot better

Causes

Seemingly slow openstack kvm machine
After machine was depooled, nodepool still didn't handle it too gracefully
- Continued to try to delete compute instances that may have no longer been reachable
- Persistent log exception noise stating that nodepool had reached its quota (10 nodes) when it only showed 6 nodes in nodepool list

Findings

Ended up switching a lot of nodepool jobs back to integration-project vms which were mostly idle during the entirety of the downtime
- https://gerrit.wikimedia.org/r/#/c/304149/ - "Move mediawiki-core-phpcs off of nodepool"
- https://gerrit.wikimedia.org/r/#/c/304131/ - "Temporarily move composer-hhvm/php5 jobs off of nodepool"
- Note that we could only move trusty jobs, we don't really have much jessie capacity atm, probably should create some more permanent slaves...
Nodepool does not have a good way of communicating with openstack, which can exacerbate any openstack compute issue
- nodepool keeps its own internal information about number of instances it has spawned which is, evidently, different than the number tracked in openstack, which leads to nodepool hammering openstack and getting 403s as a response
Openstack seeminlgy has a bug that involves it misreporting the actual number of instances a project has
- task T143018

Actions going forward

Task tracking the follow-ups: task T142952
Leave jobs running on integration nodes until there is time to reassess
Investigate if nodepool is the correct solution going forward in the long-term
task T143016 - Investigate 1 sec delay in requesting new instances
task T115194 - Some labs instances IP have multiple PTR entries in DNS - unrelated
- nodepool exasperates this issue, we believe, so investigate what effect nodepool is having
task T143013 - Investigate need for nodepool upgrade following openstack upgrade
task T143018 - Openstack seeminlgy has a bug that involves it misreporting the actual number of instances a project has