Incidents/20190321-acmechief
(Redirected from Incident documentation/20190321-acmechief)
Summary
After upgrading to acme-chief 0.14 and restarting uwsgi-acme-chief service in acmechief1001, acme-chief-api wrongly signaled to puppet the /etc/acmecerts files as directories.. effectively wiping the TLS certs used by the services whose certificates are managed by acme-chief.
The issue in acme-chief API is solved by https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/498046
Timeline
All times are in UTC
- 08:54 uwsgi-acme-chief is restarted in acmechief1001 making effective the acme-chief upgrade to 0.14.
- 08:58 slapd crashes in seaborgium after puppet runs and destroys the TLS files in /etc/acmecerts
- 09:09 toolschecker pages CRITICAL for Test LDAP on checker.tools.wmflabs.org
- 09:13 acme-chief downgraded to 0.12 and uwsgi-acme-chief restarted
Affected servers
The following servers have been affected by this issue:
- sodium.wikimedia.org
- seaborgium.wikimedia.org
- cobalt.wikimedia.org
- gerrit2001.wikimedia.org
- netmon2001.wikimedia.org
- netmon1002.wikimedia.org
- mx1001.wikimedia.org
- mx2001.wikimedia.org
- ldap-eqiad-replica01.wikimedia.org
- ldap-eqiad-replica02.wikimedia.org
- fermium.wikimedia.org
- dbmonitor1001.wikimedia.org
- dbmonitor2001.wikimedia.org