Incidents/2018-02-28 puppetmaster

From Wikitech

Summary

Puppet master suffered unavailability in eqiad, causing puppet failures

Timeline

  • 17:38 Filippo merges change 415341 to depool rhodium and repool puppetmaster1002 from puppet master duties
  • 17:40 Filippo runs puppet-merge on puppetmaster1001 to make the change above effective in apache configuration
  • 17:45 Spam from icinga-wm about puppet failures begins
  • 17:47 nitrogen is suspected being the culprit of puppet spam
  • 17:53 icinga-wm is stopped due to spam
  • 17:57:06 apache2 is restarted on puppetmaster1002
  • 17:57:55 puppetmaster1002 is allowed to talk to puppetdb, following a puppet run
  • 18:12 A puppet run is forced in eqiad on hosts with failed puppet runs
  • 18:14 puppetmaster1002 has its full load back
  • 18:29 ircecho / icinga-wm is restarted and joins IRC again

Conclusions

The root cause is a change deemed to be safe that wasn't: namely pool a puppet master backend, and needing only a puppet run on the puppet master frontend for taking full effect. The list of backends also controls access to puppetdb, in other words a puppet run on the puppetdb host is needed to allow access before repooling the backend.

Multiple factors complicated troubleshooting, including alert spam from puppet failures, leading to muting IRC alerts notifications (and potentially masking other unrelated problems). Long standing issues with puppetdb running out of memory was suspected first (task T170740) which also contributed to slowing down troubleshooting.

Actionables