Incidents/20190321-acmechief

From Wikitech

Summary

After upgrading to acme-chief 0.14 and restarting uwsgi-acme-chief service in acmechief1001, acme-chief-api wrongly signaled to puppet the /etc/acmecerts files as directories.. effectively wiping the TLS certs used by the services whose certificates are managed by acme-chief.

The issue in acme-chief API is solved by https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/498046

Timeline

All times are in UTC

  • 08:54 uwsgi-acme-chief is restarted in acmechief1001 making effective the acme-chief upgrade to 0.14.
  • 08:58 slapd crashes in seaborgium after puppet runs and destroys the TLS files in /etc/acmecerts
  • 09:09 toolschecker pages CRITICAL for Test LDAP on checker.tools.wmflabs.org
  • 09:13 acme-chief downgraded to 0.12 and uwsgi-acme-chief restarted

Affected servers

The following servers have been affected by this issue:

  • sodium.wikimedia.org
  • seaborgium.wikimedia.org
  • cobalt.wikimedia.org
  • gerrit2001.wikimedia.org
  • netmon2001.wikimedia.org
  • netmon1002.wikimedia.org
  • mx1001.wikimedia.org
  • mx2001.wikimedia.org
  • ldap-eqiad-replica01.wikimedia.org
  • ldap-eqiad-replica02.wikimedia.org
  • fermium.wikimedia.org
  • dbmonitor1001.wikimedia.org
  • dbmonitor2001.wikimedia.org