Incident documentation/20160807-CI

From Wikitech
Jump to: navigation, search

Summary

CI had a roughly 4 hours outage which unfortunately was due to a known issue where Nodepool tries to create too many files (and thus exhausting inodes) on the Jenkins master.

Timeline

  • 11:01 < Amir1> Zuul seems to be extremely slow: https://integration.wikimedia.org/zuul/
  • 11:03 < paladox> Hi nodepool seems to be down in zuul
  • 11:08 <+ Reedy> Aug 07 11:07:54 labnodepool1001 nodepoold[16727]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 11:15 <+ Reedy> I'm not restarting nodepool on a whim
  • 11:16 <+ Reedy> I'll text hashar
  • 11:35 - Reedy texted Antoine
  • 11:39 <+ hashar> Reedy: paladox around :)
  • Diagnosis/attempts to fix by deleting unused nodepool instances that were stuck
  • 12:30 - Antoine started deleting files in /var/lib/jenkins/config-history/config
    • ssh gallium find /var/lib/jenkins/config-history/config/nodes \ -path '*_deleted_*' -delete
  • 12:41 <+ hashar> Reedy: paladox ci back

Conclusions

  • We need to cleanup unused config files on a schedule

Actionables

  • Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - (Task T126552)