Incidents/20181212-icinga-down

From Wikitech

Summary

Icinga was not able to reload itself because it was stuck parsing the existing state from the retention.dat file, hence stopping all monitoring.

How was this noticed?

At the first change today that included a change in any of the Icinga config files, Puppet triggered an Icinga reload and the UI started showing an alert that the status.dat file was outdated since Icinga was not overwriting it every few seconds as it should.

Impact

All the internal alerting was not effective (no alert would have been sent to IRC or pagers) while Icinga was down, between 09:46:27 AM and 11:15:07 AM (UTC).

So the impact is really severe since we were essentially blind to failures in other services.

Detailed explanation of the issue

retention.dat grew too big because of a massive amount of downtimes done in Icinga on 2018-12-11 around 19:20.

Result of an analysis of the retention.dat file regarding downtimes and comments (only relevant parts are reported):

 Found 31 hosts with objects of type hostcomment
 - Found 3182 objects for host logstash2001
 - Found 3182 objects for host logstash2002
 - Found 3182 objects for host logstash2003
 - Found 3182 objects for host logstash2004
 - Found 3180 objects for host logstash2005
 - Found 3180 objects for host logstash2006
 Found 31 hosts with objects of type hostdowntime
 - Found 3182 objects for host logstash2002
 - Found 3182 objects for host logstash2003
 - Found 3182 objects for host logstash2004
 - Found 3180 objects for host logstash2005
 - Found 3180 objects for host logstash2006
 Found 68 hosts with objects of type servicedowntime
 - Found 28220 objects for host logstash2001
 - Found 28220 objects for host logstash2002
 - Found 28220 objects for host logstash2003
 - Found 20683 objects for host logstash2004
 - Found 20670 objects for host logstash2005
 - Found 20670 objects for host logstash2006
 Found 120 hosts with objects of type servicecomment
 - Found 28220 objects for host logstash2001
 - Found 28220 objects for host logstash2002
 - Found 28220 objects for host logstash2003
 - Found 20683 objects for host logstash2004
 - Found 20670 objects for host logstash2005
 - Found 20670 objects for host logstash2006

icinga-downtime script was ran in a loop to prevent alerts from going out while new services were added by puppet (that were expected to be in critical state when added) which would explain the large number of downtimes submitted.

start approx: 2018-12-11T14:40:48

end approx: 2018-12-11T22:59:14

Timeline (all timestamps are stated on UTC)

  • 10:20 SRE: Not sure what this is about on Icinga, but there is this on the UI: Warning: Status data OUTDATED! Last status data update was 960 seconds ago!
  • 10:22 SRE team start investigation
  • 10:39 Icinga didn't like last restart and is refusing to restart.
  • 10:43 Icinga is at 100% cpu on single core, strace shows that is polling /var/lib/icinga/rw/icinga.cmd which is the command pipe (so expected, it was a red herring)
  • 10:44 we detected that icinga2001 is also broken so a config change could be the cause.
  • 10:48 running a perf top we found out that is trying to read downtime comments.
  • 10:50 checking on Icinga objects directory doesn't looks like there is a bogus comment
  • 10:51 last downtime allegedly is from 09:20:39 AM UTC today
  • 11:07 we found out that marvin-bot which is the author set by the icinga-downtime script, downtimed massively production.
  • 11:15 all the downtimes were for logstash 2*; root@icinga1001:/var/log# fgrep icinga-downtime auth.log | grep logstash2 | wc -l => 9551
  • 11:15 we found a bash script with the following content that could be the cause of the massive downtime
# downtime.sh 
#!/bin/bash
for host in `seq -f logstash%g 2001 2006`
  do sudo /usr/local/bin/icinga-downtime -h $host -d 86400 -r 'building new host'
done
  • 11:16 removed last downtimes from retention.dat. The file went from ~160MB to ~60MB. Restarted Icinga successfully.

Conclusions

Actionables

  1. SRE should write a cookbook for downtiming hosts allowing to match a Cumin's query, instead of using bash scripts since it would prevent unwanted logic faults and could be reused for several times. SRE already has written the Icinga downtime script, so then should it be updated to handle existing downtimes better?
  2. Cookbooks runs should then sent to logstash to keep the audit log when logstash is able to hold sensitive logs. Same question re: existing Icinga downtime script. As long as we're not shipping any personal information it should be ok to do today. Could use the existing programname being shipped to logstash, or add another for scripts/cookbooks
  3. External monitoring should page when Icinga UI is up but showing errors.
  4. Is there an improvement to be made to Icinga configuration to prevent this condition? Adding downtime objects for a host/service already in downtime seems like a bug. I would expect existing downtime to be rewritten. No Icinga doesn't support this and allow to downtime with different times and comments the same object multiple times.
  5. Is it possible to make a better mechanism for disabling alerts on hosts that are going through their initial provisioning process? Yes we have a hiera parameter that can be set, but all this is not needed if using the reimage script as described in Server Lifecycle#Reimage
  6. Is it possible for Icinga to notify on CRITICAL only after a service has first been in OK state? <-- if the host comes up and the service is immediately broken (unexpectedly) we want to know that. Seems questionable if that needs to alert to everyone immediately, presumably the persons onlining the host/service would be double checking for this. As far as I know this behaviour is not possible without additional tooling.