Incident documentation/20160810-CI

From Wikitech
Jump to: navigation, search

On August 10 most of CI crawled to a near halt starting at (roughly) noon pacific time. This is due to issues between Nodepool (which runs a large portion of the tests now) and our OpenStack infrastructure.

TImeline

Graphs

  • 2016-08-10 21:09: Paladox and greg-g discuss CI being slow
  • 2016-08-10 21:10: thcipriani logs onto labnodepool1001 and checks, instances seem to be building according to nodepool list
  • 2016-08-10 21:36: legoktm notes that there seem to be no trusty nodes being built by nodepool
  • 2016-08-10 21:38: thcipriani logs into labnodepool1001 to check, there are, nominally, trusty nodes being built by nodepool
  • Nodepool continues to attempt to build instances, those instances are not being used by jobs
  • 2016-08-10 21:59: <thcipriani> restarted nodepool, no trusty instances were being used by jobs
    • This fixed the issue temporarily, building of instances continues VERY SLOWLY
  • 2016-08-10 22:54: instance creation/deletion in nodepool seems very stuck
  • 2016-08-10 22:56: <andrewbogott> thcipriani: that's one of the labvirt hosts acting up, I think yuvi is going to depool it
  • 2016-08-10 23:40: <legoktm> deploying https://gerrit.wikimedia.org/r/304131 - Temporarily move composer-hhvm/php5 jobs off of nodepool
  • 2016-08-10 23:47: <thcipriani> stopping nodepool to clean up
  • 2016-08-10 23:56: <legoktm> deploying https://gerrit.wikimedia.org/r/304149 - Move mediawiki-core-phpcs off of nodepool
  • 2016-08-11 00:02: < icinga-wm> RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
    • Puppet restarts nodepool, things seem a lot better

Causes

  • Seemingly slow openstack kvm machine
  • After machine was depooled, nodepool still didn't handle it too gracefully
    • Continued to try to delete compute instances that may have no longer been reachable
    • Persistent log exception noise stating that nodepool had reached its quota (10 nodes) when it only showed 6 nodes in nodepool list

Findings

  • Ended up switching a lot of nodepool jobs back to integration-project vms which were mostly idle during the entirety of the downtime
  • Nodepool does not have a good way of communicating with openstack, which can exacerbate any openstack compute issue
    • nodepool keeps its own internal information about number of instances it has spawned which is, evidently, different than the number tracked in openstack, which leads to nodepool hammering openstack and getting 403s as a response
  • Openstack seeminlgy has a bug that involves it misreporting the actual number of instances a project has

Actions going forward

  • Task tracking the follow-ups: Task T142952
  • Leave jobs running on integration nodes until there is time to reassess
  • Investigate if nodepool is the correct solution going forward in the long-term
  • Task T143016 - Investigate 1 sec delay in requesting new instances
  • Task T115194 - Some labs instances IP have multiple PTR entries in DNS - unrelated
    • nodepool exasperates this issue, we believe, so investigate what effect nodepool is having
  • Task T143013 - Investigate need for nodepool upgrade following openstack upgrade
  • Task T143018 - Openstack seeminlgy has a bug that involves it misreporting the actual number of instances a project has