Data Platform/Systems/Dealing with data loss alarms

We monitor webrequest-dataloss for a given webrequest_source and hour using 2 mechanisms:

The percentage of rows not having a timestamp (dt = '-') should not exceed 0.5% (warning) / 2% (error).
The actual number of rows computed using count by hostname should match the expected one, computed using MAX(sequence) - MIN(sequence) by hostname (sequence being an incremental integer assign by varnish-kafka, by host). By match, we mean the difference should be smaller than 1% (warning) / 5%(error).

You can find which mechanism has triggered the alarm message by different means:

Reading the values in the file attached to the alarm-message
checking the wmf_raw.webrequest_sequence_stats_hourly and wmf_raw.webrequest_sequence_stats

Finally, two cases are possible:

The problem comes from too many rows without timestamp: Talk to either Luca or Andrew as a discussion with the Traffic-Team is probably needed (the dt = '-' happens on varnish or varnish-kafka).
The problem comes from a mismatch between the actual number of rows and the expected-one. For this last case, false-positive warning or errors can happen (sequence-numbers either present in previous or next hour). Follow the procedure below to check for false-positive.

Example of trouble-shooting webrequest errors: https://phabricator.wikimedia.org/T208752

Check dataloss False positives

It is possible that a data loss alarm is triggered because data for one hour appears on the previous or next hour. You can verify that with this scrip

if missing less than 2%, too small, don't worry about it (and normally no email is sent)
if missing more than 2% of data, an alert email is sent -- check if data-loss (hostname, sequence) are not a false positive
- Wait until the hour AFTER the one in alert has been refined
- Connect to a server with the analytics kerberos keytab (eg. an-launcher1002)
- Run the following command with webrequest_source, year, month, day and hour updated accordingly to the alert.
- As for the table, if the email was a warning, data got refined, use wmf.webrequest. If the email was an error, data didn't get refined, use wmf_raw.webrequest.

sudo -u analytics kerberos-run-command analytics spark3-sql --master yarn -S \
         --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar \
         -f /srv/deployment/analytics/refinery/hive/webrequest/check_dataloss_false_positives.sparksql \
         -d table_name=TABLE            \
         -d webrequest_source=SOURCE    \
         -d year=YEAR                   \
         -d month=MONTH                 \
         -d day=DAY                     \
         -d hour=HOUR

If the output of this query contains rows that have the false_positivefield to false, there is real data-loss. Contact Andrew (ottomata), Luca (elukey) or Joseph (joal) to look more in details.
If the output of this query contains only rows that have false_positive field set to true, and no rows in the hours before or after, then you might be looking at the wrong kind of data loss, it might be that the records are missing a timestamp. You can check this by reading the original alert email and the attached file.

Side Note: When checking for false-positives, it is interesting to understand the data-pattern of rows without dt in hadoop partitions. The rows end up not having a dt because the initial requests to varnish might have timeout and thus the dt , which is added when the request completes it is just not there. Since Gobblin falls back to using current-timestamp if dt is not set, rows that don't have a valid dtcan end up either in hour or hour+1, depending on when it is generated and when Gobblin runs. This pattern explains why we filter no-dt rows in the current-hour but not in the next-hour in the dataloss-script: for current hour, no-dt rows have sequence numbers for either previous-hour or current-hour, while next-hour has no-dt rows for current-hour and next-hour. The script is actually interested in those rows with no-dt whith sequence numbers belonging to current-hour.

Incomplete Records

We define incomplete records as records where dt is set to -. This usually happens when the request times out, but searching webrequest for these, you can find cases when the http_status was 200, it was a cache hit on the front, and is_pageview is true. This should not happen and, indeed, when checking some of those sequence numbers with traffic we did find they had a "Timestamp Resp:" tag in varnishlog. So varnishkafka is somehow just not getting that value. NOTE for future selves: look into how varnishkafka parses this and try to figure out why it sometimes fails to do so when the request appears fine. See another look into this from many years ago, with an action item that we never followed up on: https://phabricator.wikimedia.org/T148412#2779113.

Rerunning a failed webrequest job

When there's a data loss error, the refine task will not trigger. But we might want to execute it anyway to refine at least what's there. To do so, we need to clear and re-run the airflow dag instance that reported loss. When we do this, we have to increase the warning and error thresholds. Here's a step-by-step:

ssh to the Analytics Airflow instance: Data_Engineering/Systems/Airflow/Instances#analytics
Edit the Variable for the DAG you're re-running (right now this is called "refine_webrequest_hourly_text_config" but TODO: change this when we update the Airflow job): [1]
NOTE: Before saving, realize that all DAG instances that start after that will get the new thresholds, so maybe wait until a new hour just starts processing and then save.
Set the new thresholds, for example to 100%: {"start_date":"2023-04-12-06", "data_error_threshold":"100", "data_warning_threshold":"100"}
Clear the Airflow task that checked for errors (no need to generate statistics again, they'll be the same). This will re-trigger the rest of the job to run, and refine to happen.
Set the thresholds back to the default value (either by removing them from the variable or setting them).