Incidents/2018-06-06 mx

From Wikitech

Summary

Between [2018/06/06 18:55 UTC] and [2018/06/07 02:00 UTC] outbound mail was interrupted for the below services (by puppet module/profile names).

Service (impact to outbound mail)

  • Phabricator (outbound mail delayed)
  • Gerrit (outbound mail lost)
  • Hue (outbound mail lost)
  • IEG grant review (outbound mail lost)
  • Oozie (outbound mail lost)
  • Wikimania scholarships (outbound mail lost)
  • Sentry (outbound mail lost)
  • Snapshot wikidump (outbound mail lost)

Timeline

  • 18:55 UTC Keith stops Exim on mx1001 in preparation for OS reload planned the next day. [1]
  • 01:27 UTC (June 7th) Gerrit and Phabricator email issues are reported via IRC.
  • 01:34 UTC Sam Reed creates Phabricator task describing Phabricator and Gerrit email issues T196598
  • 01:46 UTC Kunal pages SRE/Ops team via SMS.
  • 01:54 UTC Faidon checks in on IRC and begins investigating.
  • 02:00 UTC Keith checks in on IRC and begins investigating.
  • 02:00 UTC Faidon restarts exim on mx1001 which fixes the issue [2]. Affected services are able to send mail out once again. Deferred phabricator emails are dequeued and delivered.

Conclusions

  • Unexpected SPOFs and a lack of outbound mail queueing is present in several production services. The cause was tracked to two issues:
    • Phabricator is configured to use mx1001 as primary, and mx2001 as a backup, for outbound mail. For an unknown reason failover to the backup mail server did not occur when mx1001 became unavaialble.
    • Some services are configured (via puppet) to use an SMTP server of mail_smarthost[0] which populates the service configuration with only the first mailserver listed for that realm, resulting in an SPOF.

Actionables

  • Investigate and address Phabricator backup email server configuration - phab:T196916
  • Migrate services lacking built-in email queueing and failover to exim localhost SMTP server - phab:T196920
  • Explore graphing of outbound mail volume on per-service or hostgroup level - phab:T197171
  • Improve outbound mail service alerting - phab:T197172