Incidents/2018-02-28 puppetmaster

Summary

Puppet master suffered unavailability in eqiad, causing puppet failures

Timeline

17:38 Filippo merges change 415341 to depool rhodium and repool puppetmaster1002 from puppet master duties
17:40 Filippo runs puppet-merge on puppetmaster1001 to make the change above effective in apache configuration
17:45 Spam from icinga-wm about puppet failures begins
17:47 nitrogen is suspected being the culprit of puppet spam
17:53 icinga-wm is stopped due to spam
17:57:06 apache2 is restarted on puppetmaster1002
17:57:55 puppetmaster1002 is allowed to talk to puppetdb, following a puppet run
18:12 A puppet run is forced in eqiad on hosts with failed puppet runs
18:14 puppetmaster1002 has its full load back
18:29 ircecho / icinga-wm is restarted and joins IRC again

Conclusions

The root cause is a change deemed to be safe that wasn't: namely pool a puppet master backend, and needing only a puppet run on the puppet master frontend for taking full effect. The list of backends also controls access to puppetdb, in other words a puppet run on the puppetdb host is needed to allow access before repooling the backend.

Multiple factors complicated troubleshooting, including alert spam from puppet failures, leading to muting IRC alerts notifications (and potentially masking other unrelated problems). Long standing issues with puppetdb running out of memory was suspected first (task T170740) which also contributed to slowing down troubleshooting.

Actionables

Document steps to repool a puppet master backend. Done https://wikitech.wikimedia.org/wiki/Puppet#Puppetmaster
Apache httpd mod_proxy logging level should be raised to gain visibility. phab:T188601
Decrease the amount of spam in case of widespread puppet failures. phab:T188602