Incidents/20150519-Mailman

From Wikitech
Jump to navigation Jump to search

2015-05-19 Mailman outage

Summary

The mailing lists underwent maintainance on 2015-05-19 @ 17:00 UTC. The window was expected to last until 19:00 UTC. Due to unexpected issues, the mailing list server was offline and experiencing errors until roughly 21:00. This outage was contributed towards by a mailman configuration patch (merged month+ ago) causing an unexpected issue at time of mailman service restart. Tracking down and fixing this issue was a team effort with Rob, JohnFLewis (volunteer), & Daniel.

The 'outage' was actually large scale moderation of all messages sent to lists, even by list members. All messages should be able to be sent onward by list moderators.

Timeline

  • 17:04 Mailman maintainance window started per T99098
  • ~18:00: Rob finishes first of two changes, notices some odd errors on mailman restart (should have looked closer.)
  • 19:00 Changes per T95195 & T99136 are completed. Testing of changes results in discovery of some/all mailing list messages being held for moderation (even when sender is member of list.)
  • 19:00-21:00: Troubleshooting of all steps taken during maintainance window. John discovers old patchset, and livehack testing determines solution. Patchset reverted and pushed live.

Conclusions

  • lack of central logging (only roots can troubleshoot logs for sodium); should mailman logs route to central logging?
  • configuration changes need to have mailman restarts at time of change
  • unrelated: someone hacked up the mbox file on wiki-research-l and then didnt rebuild the archives. Once we did so today, it resulted in renumbering, which would have been best caught at the time of the introduction of said renumbering.

Actionables

  • Ensure ALL configuration changes are tested and production service is restarted at time of configuration change.
  • DONE via [1]
  • This may be a good case for a mailman-admins group, similar to other service groups.