Incidents/20160512-LabsLdapOutage

Summary

On the evening of May 12th, Andrew merged a patch refactoring LDAP config. Due to an oversight, that change resulted in puppet setting the ldap host to 'undef' on all labs systems.

Lack of an ldap host broke logins for all users apart from those with installed root keys. It also caused puppet runs to fail on hosts with self-hosted puppet.

The primary issue was resolved about 30 minutes later. Self-hosted puppet instances had to be fixed by hand, but were all fixed within an hour or two.

For the most part there were no user-facing consequences from this incident. The grid engine was briefly distressed, but it's unclear who or what that affected.

Timeline

[23:06] Andrew merges https://gerrit.wikimedia.org/r/#/c/288539/
[23:33] Users begin reporting that labs instances are providing a password prompt on login
[23:56] Chase calls Andrew, who appears
[00:06] Andrew writes and merges https://gerrit.wikimedia.org/r/#/c/288555/ which resolves ldap issues on most hosts
[00:30] Yuvi fixes some grid-engine fallout from the ldap failures
[00:00 - 1:30] Andrew, Yuvi and Alex Monk apply various piecemeal fixes to instances with self-hosted puppet. In most cases all that's needed is adding the proper ldap host to puppet.conf and a restart of the puppetmaster. A lot of extra time is spent on deployment-puppetmaster which turns out to have an unrelated issue -- the /var/lib/git/labs/private repo has an unmerged change which prevents updating, resulting in the ldap password failing to be updated by puppet.

Actionables

The ldap and login failures did not result in any paging. That is addressed by https://gerrit.wikimedia.org/r/#/c/288603/
We may want to improve monitoring for self-hosted puppetmasters. Specifically, when the auto-rebase tool fails, some sort of notice should be sent so that people know that the puppet state of that host is in peril.