Incidents/2019-05-09 puppet

From Wikitech

Summary

During a puppetmaster upgrade, puppetmaster200[12] went down.

Impact

all codfw server fleet that synchronizes with puppet catalog.

Detection

  • [11:00:11] <icinga-wm> PROBLEM - DPKG on puppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
  • [11:00:11] <icinga-wm> PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
  • [11:00:15] <icinga-wm> RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
  • [11:00:23] <icinga-wm> PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
  • [11:00:39] <icinga-wm> PROBLEM - puppet last run on elastic2039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
  • [11:00:39] <icinga-wm> PROBLEM - puppetmaster backend https on puppetmaster2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/Puppet%23Debugging

<stream of puppet alers>

  • [11:02:55] <volans> !log stopped ircecho to avoid spam
  • [11:02:56] <_joe_> uhm something bad happened
  • [11:02:58] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
  • [11:03:21] <jbond42> i updated puppet in codfw to 5
  • [11:03:53] <jynus> was it supposed to be a temporary interruption or a long one?
  • [11:04:19] <volans> the 2001 frontend is logging proxy-server/404

Timeline

All times UTC.

  • 10:55:37 - jbond ran sudo cumin A:codfw 'apt-get install -y puppet facter'
  • 11:00 stream of puppet failures alerts on icinga
  • 11:02 < volans> !log stopped ircecho to avoid spam
  • 11:04 puppet disabled every where
  • 11:06 we noticed that due to a dependency on puppet 4 the puppet masters and puppetdb servers removed the puppet-master, puppet-master-passanger and puppetdb packages
  • 11:09 Test manual fix on puppetmaster2001 sudo apt-get install facter=2.4.6-1 puppet=4.8.2-5 puppet-master puppet-master-passenger
  • 11:10 < jbond42> [test successfull] on puppetmaster2001 and puppet is running ok there now
  • 11:13 test fix on puppetdb2001 apt-get install facter=2.4.6-1 puppet=4.8.2-5 puppetdb test successfull
  • 11:14 roll out fix to puppetmaster200{1,2} & puppetdb2001
  • 11:17 moritzm notices that our puppetdb package has Depends: puppet (<< 5.0.0-1puppetlabs), and need to be rebuild
  • 11:29 create change to so puppet masters do not recive new components https://gerrit.wikimedia.org/r/509040 (later abbandonded)
  • 11:41 create change using regex to ensure puppetmaster and db's dont get the new components https://gerrit.wikimedia.org/r/509042
  • 11:53 push the above change
  • 11:55 clean up file deployed during original change
         - sudo cumin A:puppetmaster 'rm /etc/apt/sources.list.d/component-facter3.list /etc/apt/sources.list.d/component-puppet5.list'
         - sudo cumin A:puppetdb 'rm /etc/apt/sources.list.d/component-facter3.list /etc/apt/sources.list.d/component-puppet5.list'
         - sudo cumin 'lab*puppetmaster*' 'rm /etc/apt/sources.list.d/component-facter3.list /etc/apt/sources.list.d/component-puppet5.list'
  • 11:57 < jbond42> !log all puppetmasters and puppetdbs should be restored'
  • ~12:10: puppet enabled every where [not sure on the exact time]
  • 12:18 <@moritzm> I'm running "run-puppet-agent -q --failed-only" on remaining hosts

Conclusions

What weaknesses did we learn about and how can we address them?

The following sub-sections should have a couple brief bullet points each.

What went well?

  • This error and the underlioning issue where both identified quickly
  • team remaind calm and comminucation in the ops channle was reduced to incident response

What went poorly?

  • The initial update of puppet and facter was not !logged, meaning other SRE engineers where initially unaware of the cause of the issue
  • there was some confusion regarding which daemons need to be started i.e. puppet-master service should not be started as apache is the service we care about

Where did we get lucky?

  • for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.