Incidents/20160807-CI
Appearance
Summary
CI had a roughly 4 hours outage which unfortunately was due to a known issue where Nodepool tries to create too many files (and thus exhausting inodes) on the Jenkins master.
Timeline
- 11:01 < Amir1> Zuul seems to be extremely slow: https://integration.wikimedia.org/zuul/
- 11:03 < paladox> Hi nodepool seems to be down in zuul
- 11:08 <+ Reedy> Aug 07 11:07:54 labnodepool1001 nodepoold[16727]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
- 11:15 <+ Reedy> I'm not restarting nodepool on a whim
- 11:16 <+ Reedy> I'll text hashar
- 11:35 - Reedy texted Antoine
- 11:39 <+ hashar> Reedy: paladox around :)
- Diagnosis/attempts to fix by deleting unused nodepool instances that were stuck
- 12:30 - Antoine started deleting files in /var/lib/jenkins/config-history/config
- ssh gallium find /var/lib/jenkins/config-history/config/nodes \ -path '*_deleted_*' -delete
- 12:41 <+ hashar> Reedy: paladox ci back
Conclusions
- We need to cleanup unused config files on a schedule
Actionables
- Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - (task T126552)