Incidents/2018-06-06 mx
Appearance
(Redirected from Incident documentation/20180606-mx)
Summary
Between [2018/06/06 18:55 UTC] and [2018/06/07 02:00 UTC] outbound mail was interrupted for the below services (by puppet module/profile names).
Service (impact to outbound mail)
- Phabricator (outbound mail delayed)
- Gerrit (outbound mail lost)
- Hue (outbound mail lost)
- IEG grant review (outbound mail lost)
- Oozie (outbound mail lost)
- Wikimania scholarships (outbound mail lost)
- Sentry (outbound mail lost)
- Snapshot wikidump (outbound mail lost)
Timeline
- 18:55 UTC Keith stops Exim on mx1001 in preparation for OS reload planned the next day. [1]
- 01:27 UTC (June 7th) Gerrit and Phabricator email issues are reported via IRC.
- 01:34 UTC Sam Reed creates Phabricator task describing Phabricator and Gerrit email issues T196598
- 01:46 UTC Kunal pages SRE/Ops team via SMS.
- 01:54 UTC Faidon checks in on IRC and begins investigating.
- 02:00 UTC Keith checks in on IRC and begins investigating.
- 02:00 UTC Faidon restarts exim on mx1001 which fixes the issue [2]. Affected services are able to send mail out once again. Deferred phabricator emails are dequeued and delivered.
Conclusions
- Unexpected SPOFs and a lack of outbound mail queueing is present in several production services. The cause was tracked to two issues:
- Phabricator is configured to use mx1001 as primary, and mx2001 as a backup, for outbound mail. For an unknown reason failover to the backup mail server did not occur when mx1001 became unavaialble.
- Some services are configured (via puppet) to use an SMTP server of mail_smarthost[0] which populates the service configuration with only the first mailserver listed for that realm, resulting in an SPOF.
Actionables
- Investigate and address Phabricator backup email server configuration - phab:T196916
- Migrate services lacking built-in email queueing and failover to exim localhost SMTP server - phab:T196920
- Explore graphing of outbound mail volume on per-service or hostgroup level - phab:T197171
- Improve outbound mail service alerting - phab:T197172