Fundraising/Data and flow/Failmail zoo

From Wikitech

A guide to the species of failmail encountered in our systems

Fail Mail Storm - lots of different jobs failing at once

One reason for this could be a long running mysql query on civi.

Using innotop

To check:

1. Log into civi1002

2. innotop -h frdb1005

3. Q will let you see the queries running

4. Press e and use Thread ID to see EXPLAIN of query

5. Outside of innotop you can kill the query with civicrm_query_killer -q {id of the query}

Figuring out who the user is

There will be a User : {id} line as part of the query. You can figure out who that id is tied to at /user/{id} in the civi url.

If the civi username isn't obvious to figure out who it is, everyone will be on this contact list: https://collab.wikimedia.org/wiki/Fundraising#Contact_List

Using processlist

1. Log onto civi1001

2. drush @wmff civicrm-sql-cli

This gets you into mysql

3.Show processlist;

This will show a list of queries currently running if one has been running for a long time you can kill it.

You can also use full processlist to see more details:

Show full processlist;

There has been at least one incident where the long running query didn't show up on show processlist. Check the tables affected in the deadlock error message, this can give a hint to what could be causing the deadlocks. You can keep checking show processlist too to see if something new shows up.

Turning off the fail mail per payment processor

You can override the failmail setting defaults that are in provider-defaults.yaml in each individual payment processors main.yaml.

Normal Logging

logging:
  log-level: 7

Ignoring Logging

logging:
    log-level: 7

    log-streams:
        # Temporary setting due to ongoing development work
        failmail:
            class: SmashPig\Core\Logging\LogStreams\NullLogStream

Fail Mail (civi1001) run-job: XXX failed with code 1

This phylum of failmail is generated by the process-control script. The subject line indicates which job failed, and with which exit code, but for any other details you will have to log on to civi1001 and read the log indicated in the body of the failmail.

If the nightly Silverpop EXPORT jobs fail, this could lead to bulk email complaints. Someone should log into the Civi box and run-job silverpop_daily (if the build export files fails) or run-job silverpop_emails_upload_files (if that job failed).

Fail Mail (civi1001) run-job: Check forget me processing failed with code 1

A GDPR data erasure job has failed at Acoustic (formerly known as Silverpop). When someone makes a forgetme request from the Civi UI, Civi sends Acoustic a GDPR erasure request for that email address for each of the lists we maintain there. More details at Fundraising/Data_and_Integrated_Processes/Acoustic_Integration#Forgetting.

Acoustic queues the requests and returns job IDs. We store those in Civi as OmnimailJobProgress entities. The omnimail_check_progress process-control job looks at this table and queries Acoustic to see which of the jobs have finished. When an Acoustic job doesn't finish after a certain amount of time, the job sends failmail.

To check on which address triggered the failing job, use the API4 explorer to check on the jobs at <civiurl>/civicrm/api4#/explorer/OmnimailJobProgress/get . The job identifier should indicate the email address, and the retrieval parameters should include the list ID at Acoustic.

Log into Acoustic and search for that email address in the indicated list. Make sure to check the boxes underneath the search fields to search all of the email addresses. If the email address is not in the list, you can just delete the OmnimailJobProgress entity, again using the API4 explorer. See T244811

If the email address is in the list, you can use this process to redo the request on Acoustic's end https://help.goacoustic.com/hc/en-us/articles/360043344233-Access-CCPA-and-GDPR-requests-through-the-Acoustic-Campaign-user-interface

There is a scary pop up that talks about deletion with the entire database selected. This doesn't delete the database, just the emails uploaded in the csv.

Aborting! Previous process (XXX) is still alive. Remove lockfile manually if in error: /tmp/<jobname>.lock

This indicates that a job has run longer than expected. When process-control starts a job and finds that the previous installment of the job is still running (as indicated by the lockfile still being pressent), it aborts the next run and sends this email. This email can be suppressed by adding "allow_overtime: true" to the job's yaml configuration file. When allow_overtime is true, the next job still won't run, but process-control won't send this email.

Error validating Amazon message. Firewall problem?

Our IPN listener receives messages from Amazon Simple Notification Services when donors send an Amazon Pay donation. Each notification contains a URL for a certificate, and part of the Amazon Pay SDK's logic retrieves that cert and uses it to check the signature. This failmail means something went wrong getting the cert.

There is an IP address in this failmail, but it's not the one we're trying to connect to and failing. The IP logged for all Amazon SNS messages is the SOURCE of the notification, not the IP address we try to contact to get the cert. To find the offending IP address, you will have to ssh into thulium / frpig1001 and try to fetch the cert from the URL mentioned in the log.

For example: curl -v -v https://sns.us-east-1.amazonaws.com/SimpleNotificationService-eaea6120e66ea12e88dcd8bcbddca752.pem

Fundraising ops will have to make sure that IP address is whitelisted in IPtables on thulium / frpig1001.

FIXME: this species tends to arrive in herds of thousands, mercilessly trampling inboxes until someone mangles the SmashPig config files to disable failmail

INVALID_MESSAGE Cannot create contribution, civi error!

This is usually due to a DB lock issue. Our queue consumer creates the contact and their contribution within a single transaction involving upwards of a dozen tables. Depending on what else is happening in Civi, this can result in a deadlock and the contact insert can fail. In this case, the queue message is generally shunted to the 'damaged' table. Follow the link at the bottom of the failmail and click 'resend' at the bottom of the damaged message UI to try the insert again.

If the error contains "'blah' is not a valid option for field custom_20", we have a mismatch in the 'name' and 'value' columns of a Direct Mail Appeal option (in the Gift Data group). This generally happens with values that have been created in the UI. You can fix the name to match the value by using the API explorer.

MISSING_PREDECESSOR Message <gateway>-XXXXXX indicates a pending DB entry with order ID YYYY, but none was found

Amazon and Astropay send IPN messages indicating payment completion, but without much donor info. The IPN listeners forward these messages along to the donations queue with a key 'completion_message_id' indicating how to find the rest of the information in the pending DB. The donations queue consumer looks for this row in the pending table, combines the pending info with the IPN message, and deletes the pending row.

Only sometimes, the pending row isn't found. Maybe it was deleted early? Maybe the donation came in from outer space (or Alexa) and there never was a pending row? Maybe the pending queue consumer is turned off? When this happens, the message is kicked to the damaged table with a requeue date. After enough retries, it's left on the damaged table with no requeue date and the queue consumer sends this email.

TODO: instead of the donations queue consumer doing this recombination, it could happen in a job on a new jobs-amazon queue. That job could fall back to an Amazon API call if it doesn't find the info in the pending table (https://phabricator.wikimedia.org/T194317).

Fail Mail: BAD_AUDIT_LINE (paypal-audit)

Usually due to a missing subscr_id. This is a thing that randomly breaks every few months in the audit-file-producing machinery at PayPal, and confounds our process because we use the subscr_id to associate the donation with the parent subscription.

To deal with it:

  1. Email our tech contact at PayPal asking them to re-generate the file (filename should be included
  2. Use the one_off process-control job to delete the file from the /var/spool/audit/paypal/incoming or /var/spool/audit/paypal/completed directory
  3. The next nightly audit download & parse job should get the fixed file from PayPal

TODO: email once per file (or per parser run) with a summary of errors. Once per error is too darn much!

Last seen November 2020

CIVI_REQ_FIELD Positive amount required

If the gateway is PayPal, probably someone is messing with our old account and sending donations below the minimum we should be allowing. Email our tech contact there and ask them to reinstate the block on sub-minimum donations. Long term plan: when we're totally happy with Express Checkout, turn off the old account.

Last seen January 2020

EMAIL_SYSTEM_FAILURE: civi1001 (thank_you)

This one mostly pops up when we try to send an email to an invalid email address. In the case of email addresses that start with a hyphen(-), it turns out that this is valid, e.g. -dashatstart@example.com is a valid email address, but it gets rejected by postfix (see https://serverfault.com/questions/893648/postfix-bad-recipient-address-syntax-email-starting-with-dash-hyphen/893657#893657) when running the default config. More discussion is needed on whether we want to change our setup to support this.

Get recipient data from Silverpop failed with code 1

If the error message in the log says "Request failed: no data found for trigger id XXXYYYY" you can fix this by clearing out a couple of field on an omnimail job progress table:

update civicrm_omnimail_job_progress
set progress_end_timestamp=NULL, retrieval_parameters=NULL
where job='omnimail_omnirecipient_load';

Possibly Extinct

The ones below haven't been seen since the date listed by them.

2019-08-28 Fredge Multiqueue Consumer failed with code 1

This indicates either a DB access problem, or a coding error on our part. Look in the log referenced in the failmail. If it's a DB access issue, check with FR-ops and turn off the fredge queue consumer if necessary.

If it's an issue with one of the messages, like a badly formatted or missing field, determine the source of the message from the source_ fields and fix the offending component.

2018-09-06 INVALID_RECURRING Subscription account does not exist

Currently always PayPal-related. The donation has come in with a subscr_id value that doesn't correspond to any trxn_id in the civicrm_contribution_recur table. For PayPal, we create civicrm_contribution_recur records when we get a specific type of IPN message or audit line, so maybe that failed? Look at earlier damaged messages to see if there's a subscr_start one with the missing trxn_id that you can re-send.