Incidents/2021-02-26 sudo

From Wikitech

document status: in-review

Summary

A patch was merged and deployed to all hosts containing a syntax error on the /etc/sudoers file. This meant sudo did not work for the period of time indicated below, affecting mostly nagios execution (alerting) and creating root mail spam. As a consequence, also mail delivery got overloaded/delayed.

Timeline

  • 08:50 666899 is merged, containing a syntax error in /etc/sudoers
  • 08:51 People warn on IRC unable to sudo on db1107 due to a parse error (>>> /etc/sudoers: syntax error near line 6 <<<), and other hosts
  • 08:52 100s of emails start to arrive to root@ with *** SECURITY information for <hostname>*** (sudo failures)
  • 08:55 <jbond42> !log disabled puppet pending rollback of https://gerrit.wikimedia.org/r/666899
  • 08:59 klausman merges 667110, containing a fix, and runs puppet-merge soon after.
  • 09:00 Incident opened. jynus becomes IC.
  • 09:06 Puppet reenabled
  • 09:12 Start reenabling puppet fleetwide
  • 09:23 Puppet run at 10%
  • 09:37 Puppet run at 30%
  • 09:50 Puppet run at 50%
  • 10:17 Puppet run at 80%
  • ~10:20ish UNKNOWN nagios alerts gone
  • 10:33 puppet run finished
  • 11:40: jbond took over ic
  • 11:45: mx2001 queue has remain static at 4344 for 20 minutes
  • 11:45: mx1001 queue reducing at between 0-3 msgs/sec
  • 12:55: run `exiqgrep -i -o 7200 -y 10800 -f 'root@wikimedia.org' | xargs exim -Mrm` on mx servers
  • 12:55: queue down to ~ 2000 (891 frozen) msgs on mx1001 and 500 (434 frozen) on mx2001
  • 13:02: Still receiving 450-4.2.1 from gmail for a number of recipients
  • 13:30: reports of flood emails slowing down
  • 13:43: message queue excluding frozen messages on mx2001 is 0 (mx1001 ~ 800)
  • 14:00 ran the following to push through the last few messages: `for i in $(sudo exiqgrep -f nagios@lists1001.wikimedia.org -i) ; do sudo exim -M ${i} ; sleep 1 ; done `
  • 14:05: unfrozen queue is still at 784 however the queue looks normal
  • 14:05: Incident resolved

Cleanup GMail

You can use this filter to find them all:

 from:(nagios@) SECURITY after:2021-02-25 

Look also in your spam folder.

Remediation considerations

As Puppet runs as root and is triggered by a cron entry, an issue with sudo does not affect the capability of running Puppet and hence to fix the problem.

In addition as Cumin does SSH as root in all hosts, it's also possible to ssh into the cumin host, become root using the password in pwstore and perform an emergency fix via Cumin on all hosts. If for some reason it would not be possible to SSH directly into the Cumin host, it's also possible to connect to it via the management console and login from there directly as root.

Actionables