2015-05-19 Mailman outage
The mailing lists underwent maintainance on 2015-05-19 @ 17:00 UTC. The window was expected to last until 19:00 UTC. Due to unexpected issues, the mailing list server was offline and experiencing errors until roughly 21:00. This outage was contributed towards by a mailman configuration patch (merged month+ ago) causing an unexpected issue at time of mailman service restart. Tracking down and fixing this issue was a team effort with Rob, JohnFLewis (volunteer), & Daniel.
The 'outage' was actually large scale moderation of all messages sent to lists, even by list members. All messages should be able to be sent onward by list moderators.
- 17:01 (April 15): Mailman configuration patch is merged and deployed but mailman is not restarted.
- 17:04 Mailman maintainance window started per T99098
- ~18:00: Rob finishes first of two changes, notices some odd errors on mailman restart (should have looked closer.)
- 19:00 Changes per T95195 & T99136 are completed. Testing of changes results in discovery of some/all mailing list messages being held for moderation (even when sender is member of list.)
- 19:00-21:00: Troubleshooting of all steps taken during maintainance window. John discovers old patchset, and livehack testing determines solution. Patchset reverted and pushed live.
- lack of central logging (only roots can troubleshoot logs for sodium); should mailman logs route to central logging?
- configuration changes need to have mailman restarts at time of change
- unrelated: someone hacked up the mbox file on wiki-research-l and then didnt rebuild the archives. Once we did so today, it resulted in renumbering, which would have been best caught at the time of the introduction of said renumbering.
- Ensure ALL configuration changes are tested and production service is restarted at time of configuration change.
- DONE via 
- This may be a good case for a mailman-admins group, similar to other service groups.
- Either push logs to central logging, or setup a user group for non-opsen to access mailman logs: https://phabricator.wikimedia.org/T99734