Incidents/2022-10-17 mx and vrts

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-10-17 mx and vrts	Start	2022-10-17T20:32:09Z
Task	T321135	End	2022-10-17T23:07:09Z
People paged	6 (2 of them on-call, 4 opt-in for 24/7 pages)	Responder count	5 (2 on-call, 3 responded without having been paged)
Coordinators	dzahn	Affected metrics/SLOs	No relevant SLOs exist
Impact	delayed mail delivery, users of VRT and general email recipients received mail delayed and received spam email

…

A wave of spam email to an info@ address was routed from mail servers to the VRT machine (otrs1001).

Many Perl processes were spawned which used up all the RAM of the virtual machine. oom-killer killed clamav-daemon.

Without clamav mail delivery stopped.

More mails started queing up on both otrs1001 and then the mail server mx1001.

When the mail queue reached a critical threshold on mx1001, SRE got paged.

Measures taken included increasing RAM available on the otrs1001 VM and deleting spam email.

Eventually all mail was delivered, just with a delay.

Timeline

Consider including a graphs of the error rate or other surrogate.

All times in UTC.

2022-10-17T20:32:09Z OUTAGE BEGINS with "20:32 <+jinxer-wm> (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 7353 #page.."
2022-10-17:20:53:00Z It's identified that mail delivery fails because clamav-daemon gets killed by OOM-killer
2022-10-17:21:02:00Z The "max_threads" setting in the clamav-daemon config is changed from 12 to 2 and subsequently 1, in an attempt to keep it from being killed.
2022-10-17:21:46:00Z A bash script is executed that removes spam mail matching certain patterns (mini cooper) on mx1001.
2022-10-17:22:02:00Z The same script is executed on otrs1001 and the mail queue has been reduced by a lot
2022-10-17:22:34:00Z 'gnt-instance command is executed to increase RAM of the VM from 4GGB to 8GB
2022-10-17:22:39:00Z Number of mails in the queue (exiqgrep -c) starts to go down.
2022-10-17:22:45:00Z Puppet is renabled and run which reverts the previous changes to max_threads of clamav-daemon. It is using 12 threads again.
2022-10-17T22:51:00Z 'exim4 -qf' is executed on mx1001 to re-deliver queued mails, swapping continues but no more OOMs
2022-10-17T22:54:00Z memory freed up after an initial burst of activity
2022-10-17T23:07:09Z OUTAGE ENDS with "<+jinxer-wm> (MXQueueHigh) resolved: MX host mx2001:9100 has many queued messages: 4623.." when mail qeue is under threshold again.

https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2022-10-17

Detection

On-call SRE got paged by Splunk-On-Call (VictorOps)

incident name was: Critical: [FIRING:1] MXQueueHigh misc (node ops page prometheus sre), incident ID: VictorOps 3094

Conclusions

We should have more than a single VRTS server.

Spam should not take down the VRTS machine.

What went well?

SRE were online and had ideas what to do.

What went poorly?

It took longer than necessary until we did the reboot step out of concerns the server would not come back and because we don't have a fail-over machine.

Where did we get lucky?

There actually was no problem with the VM coming back from reboot despite the long uptime and concerns about the NIC changing names.

Links to relevant documentation

https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM

Actionables

Increase RAM assigned to otrs1001 VM (done, increased from 4 to 8GB RAM)

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes	VO attempted to page 6 people. 2 of them were on-call and were reached. 4 more seem to opt-in for 24/7 pages but did not respond. 3 other users did respond without being paged.
	Were pages routed to the correct sub-team(s)?	no	There was no expectation that would happen. It was during assigned on-call rotation.
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the incident status section actively updated during the incident?	no
	Was the public status page updated?	no
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes	To the best of our knowledge
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes	unless you count the general exim->postfix switch which might come with rspamd
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)