Exim

Troubleshooting "exim queue warning" alerts

Metrics dashboard: https://grafana.wikimedia.org/d/000000451/mail

This alert fires when the number of queued emails on an Exim server exceed the defined threshold. Usually about 1000 messages. A few possible causes are:

A remote domain who receives a lot of our email is down, and we are queueing messages for redelivery.
A user who receives a lot of our email has a problem with their mail inbox (over quota, account removed, etc.) and we are queueing messages for redelivery.
Problematic messages are being relayed through our mail system and are being temporarily rejected, causing us to queue valid mail for redelivery (rate limiting, spam prevention, etc.)

To get a better understanding of why this alert happened you can try reviewing the size of the queue on the relevant host(s) and look to see if the problem is localized to a single domain, or if it affects multiple domains. For example:

root@mx1001:~# mailq | exiqsumm -c

Count  Volume  Oldest  Newest  Domain
-----  ------  ------  ------  ------

  881  5700KB      4d      0m  gmail.com
   21   104KB    126d    126d  wikipedia.org
   16    47KB    259d     11d  mediawiki.com
   14    35KB    259d     39m  wikibooks.com
   13    38KB    259d     70d  wikipedia.in
   11    23KB    259d     18d  wiktionary.com
    9    21KB    259d     13d  wikiversity.com

In this case the vast majority of mail in the queue is for gmail.com. Good, now we know the problem is not affecting multiple domains. Now we might wonder how many users at that domain, in this case gmail.com, are affected. We can get a quick count of deferred messages for a given domain with something like the following

mx1001:~# mailq | grep gmail.com | grep -v D | sort | uniq -c | sort -n

      2           foobar@gmail.com
     15           darthvader@gmail.com
     40           bazbaz@gmail.com
    812           redacted@gmail.com

Ok, from this we can see user redacted@gmail.com is responsible for the vast majority of queued messages. With this info we can then look for a reason in the exim logs. This may take some hunting and pecking through the logs depending on the nature/scope of the problem. Something like the below is a good start and will often give hints

# grep -i $problem_domain /var/log/exim4/mainlog | grep -i error
# for example:

mx1001:~# grep -i redacted@gmail.com /var/log/exim4/mainlog | grep -i error

# example error:
H=alt2.gmail-smtp-in.l.google.com [209.85.202.27]: SMTP error from remote mail server after RCPT TO:<redacted@gmail.com>: 452-4.2.2 The email account that you tried to reach is over quota. Please direct\n452-4.2.2 the recipient to\n452 4.2.2  https://support.google.com/mail/?p=OverQuotaTemp s27si1505313edm.307 - gsmtp

If errors are present you should see an indication of if they are 4xx (temporary) errors or 5xx (permanent) errors, and a short description of the problem as provided by the recipients mail system.

In the example above we can see a recipient is over quota, and their mail provider is temporarily rejecting messages with a 452 code, so our mail server is queueing them. If the user receives a lot of mail this can push the check over the alert threshold.

show mail queue

mailq
exim -bp

flush mail queue

runq
exim -q

force delivery attempt

exim -qf (non-frozen messages)
exim -qff (all messages, frozen or not)

deliver just one specific mail from queue

exim -M [queue-id]

(the queue-id is what you see after the size and before the email address)

search for specific mails in the queue

exiqgrep -f <sender address>
exiqgrep -r <recipient address>
man eqixgrep for more options

count number of mails to a specific recipient

 exiqgrep -cr <recipient address>

remove emails to a specific recipient

exiqgrep -i -r specific@recipient.org | xargs exim -Mrm

test address routing

On e.g. mx1001.wikimedia.org

exim -bt <address>

cheat sheet

http://bradthemad.org/tech/notes/exim_cheatsheet.php