Analytics/Systems/Dealing with data loss alarms

From Wikitech
Jump to navigation Jump to search

We monitor webrequest-dataloss for a given webrequest_source and hour using 2 mechanisms:

  • The percentage of rows not having a timestamp (dt = '-') should not exceed 0.2% (warning) / 1% (error).
  • The actual number of rows computed using count by hostname should match the expected one, computed using MAX(sequence) - MIN(sequence) by hostname (sequence being an incremental integer assign by varnish-kafka, by host). By match, we mean the difference should be smaller than 1% (warning) / 5%(error).

You can find which mechanism has triggered the alarm message by different means:

  • Reading the values in the file attached to the alarm-message (no preview available, need to download and open with a text-editor)
  • checking the wmf_raw.webrequest_sequence_stats_hourly and wmf_raw.webrequest_sequence_stats

Finally, two cases are possible:

  • The problem comes from too many rows without timestamp: Talk to either Luca or Andrew as a discussion with the Traffic-Team is probably needed (the dt = '-' happens on varnish or varnish-kafka).
  • The problem comes from a mismatch between the actual number of rows and the expected-one. For this last case, false-positive warning or errors can happen (sequence-numbers either present in previous or next hour). Follow the procedure below to check for false-positive.

Example of trouble-shooting webrequest errors: https://phabricator.wikimedia.org/T208752

Check dataloss False positives[edit | edit source]

It is possible that a data loss alarm is triggered because data for one hour appears on the previous or next hour. You can verify that with this scrip

  • if missing less than 1%, too small, don't worry about it (and normally no email is sent)
  • if missing more than 1% of data, an alert email is sent -- check if data-loss (hostname, sequence) are not a false positive
    • Wait until the hour AFTER the one in alert has been refined
    • Connect to stat1004 or stat1007
    • Run the following command with webrequest_source, year, month, day and hour updated accordingly to the alert
sudo -u hdfs spark2-sql --master yarn -S -f /srv/deployment/analytics/refinery/hive/webrequest/check_dataloss_false_positives.sparksql \
         -d table_name=wmf.webrequest   \
         -d webrequest_source=SOURCE    \
         -d year=YEAR                   \
         -d month=MONTH                 \
         -d day=DAY                     \
         -d hour=HOUR
  • If the output of this query contains rows that have the false_positivefield to false, there is real data-loss. Contact Andrew (ottomata), Luca (elukey) or Joseph (joal) to look more in details.

Side Note: When checking for false-positives, it is interesting to understand the data-pattern of rows without dt in hadoop partitions. The rows end up not having a dt because the initial requests to varnish might have timeout and thus the dt , which is added when the request completes it is just not there. Since Camus uses current-timestamp to partition, rows that don't have a valid dtcan end up either in hour or hour+1, depending on when it is generated and when camus runs. This pattern explains why we filter no-dt rows in the current-hour but not in the next-hour in the dataloss-script: for current hour, no-dt rows have sequence numbers for either previous-jour or current-hour, while next-hour has no-dt rows for current-hour and next-hour. The script is actually interested in those rows with no-dt whith sequence numbers belonging to current-hour.

Rerunning a failed webrequest job

When there's a data loss error, the refine job will not trigger. But we might want to execute it anyway to refine at least what's there. To do so, we need to re-run the webrequest-load workflow with high enough warning/error thresholds. However, it is not possible to re-run the job from Hue's coordinator view, because it won't let us change the job properties, unless we execute the job with our user (not hdfs), which will fail. So, we need to re-run from the oozie command line. Here's a step-by-step:

  1. Edit your local copy of the properties file refinery/oozie/webrequest/load/bundle.properties, and replace the property oozie.coord.application.path with oozie.coord.application.path = ${coordinator_file}. This will tell oozie that it should run a coordinator, instead of a bundle.
  2. Scp the modified properties file to your home directory on an-coord1001.eqiad.wmnet.
  3. Ssh into an-coord1001.eqiad.wmnet, and run this command to launch a temporary coordinator that will just execute one hourly workflow for your rerun:
sudo -u hdfs oozie job --oozie $OOZIE_URL \
    -Dstart_time=<START_TIME> \
    -Dstop_time=<STOP_TIME> \
    -Dwebrequest_source=<WEBREQUEST_SOURCE> \
    -Derror_incomplete_data_threshold=100 \
    -Dwarning_incomplete_data_threshold=100 \
    -Derror_data_loss_threshold=100 \
    -Dwarning_data_loss_threshold=100 \
    -submit -config <PATH_TO_PROPERTIES_FILE>

Where <START_TIME> is the exact hour you want to rerun, in YYYY-MM-DDTHH:00Z format, i.e.: 2019-01-01T00:00Z. <STOP_TIME> should be the same as <START_TIME> but with 59 in the minute slot, i.e.: 2019-01-01T00:59Z. <WEBREQUEST_SOURCE> should either be text or upload, depending on what you want to rerun. And <PATH_TO_PROPERTIES_FILE> should be the path to your modified scp'ed copy of the properties file.