Troubleshooting "exim queue warning" alerts
Metrics dashboard: https://grafana.wikimedia.org/d/000000451/mail
This alert fires when the number of queued emails on an Exim server exceed the defined threshold. Usually about 1000 messages. A few possible causes are:
- A remote domain who receives a lot of our email is down, and we are queueing messages for redelivery.
- A user who receives a lot of our email has a problem with their mail inbox (over quota, account removed, etc.) and we are queueing messages for redelivery.
- Problematic messages are being relayed through our mail system and are being temporarily rejected, causing us to queue valid mail for redelivery (rate limiting, spam prevention, etc.)
To get a better understanding of why this alert happened you can try reviewing the size of the queue on the relevant host(s) and look to see if the problem is localized to a single domain, or if it affects multiple domains. For example:
root@mx1001:~# mailq | exiqsumm -c Count Volume Oldest Newest Domain ----- ------ ------ ------ ------ 881 5700KB 4d 0m gmail.com 21 104KB 126d 126d wikipedia.org 16 47KB 259d 11d mediawiki.com 14 35KB 259d 39m wikibooks.com 13 38KB 259d 70d wikipedia.in 11 23KB 259d 18d wiktionary.com 9 21KB 259d 13d wikiversity.com
In this case the vast majority of mail in the queue is for gmail.com. Good, now we know the problem is not affecting multiple domains. Now we might wonder how many users at that domain, in this case gmail.com, are affected. We can get a quick count of deferred messages for a given domain with something like the following
mx1001:~# mailq | grep gmail.com | grep -v D | sort | uniq -c | sort -n 2 firstname.lastname@example.org 15 email@example.com 40 firstname.lastname@example.org 812 email@example.com
Ok, from this we can see user firstname.lastname@example.org is responsible for the vast majority of queued messages. With this info we can then look for a reason in the exim logs. This may take some hunting and pecking through the logs depending on the nature/scope of the problem. Something like the below is a good start and will often give hints
# grep -i $problem_domain /var/log/exim4/mainlog | grep -i error # for example: mx1001:~# grep -i email@example.com /var/log/exim4/mainlog | grep -i error # example error: H=alt2.gmail-smtp-in.l.google.com [188.8.131.52]: SMTP error from remote mail server after RCPT TO:<firstname.lastname@example.org>: 452-4.2.2 The email account that you tried to reach is over quota. Please direct\n452-4.2.2 the recipient to\n452 4.2.2 https://support.google.com/mail/?p=OverQuotaTemp s27si1505313edm.307 - gsmtp
If errors are present you should see an indication of if they are 4xx (temporary) errors or 5xx (permanent) errors, and a short description of the problem as provided by the recipients mail system.
In the example above we can see a recipient is over quota, and their mail provider is temporarily rejecting messages with a 452 code, so our mail server is queueing them. If the user receives a lot of mail this can push the check over the alert threshold.
show mail queue
mailq exim -bp
flush mail queue
runq exim -q
force delivery attempt
exim -qf (non-frozen messages) exim -qff (all messages, frozen or not)
deliver just one specific mail from queue
exim -M [queue-id]
(the queue-id is what you see after the size and before the email address)
search for specific mails in the queue
exiqgrep -f <sender address> exiqgrep -r <recipient address> man eqixgrep for more options
count number of mails to a specific recipient
exiqgrep -cr <recipient address>
remove emails to a specific recipient
exiqgrep -i -r email@example.com | xargs exim -Mrm
test address routing
On e.g. mx1001.wikimedia.org
exim -bt <address>