Analytics/Data Lake/Data Issues/2023-01-08 Webrequest Data Loss

From Wikitech
Date issue detected: 2023-01-08
Date resolved: 2023-01-09
Start of issue: 2023-01-08
Status: Resolved
Phabricator Ticket(s): https://phabricator.wikimedia.org/T328354

https://phabricator.wikimedia.org/T326658

Summary

On the 8TH of January 2023, some caching nodes failed to send some webrequest traffic resulting in loss of data between the hours of 0800 till 1400. We estimate this resulted in a 2.83% loss of webrequest-text data leading to an underreporting of pageviews and all webrequest related data.

Description

The caching nodes in the Eqsin datacenter failed to collect some traffic data between the hours of 8:00 and 14:00 on the 8th of January 2023. This webrequest-text traffic data was affected during this time period. The table below contains details data loss during the affected time period.

Time Estimated Data loss
2023-01-08 08:00 2.84%
2023-01-08 09:00 2.66%
2023-01-08 10:00 2.55%
2023-01-08 11:00 2.916%
2023-01-08 12:00 3.09%
2023-01-08 13:00 3.08%
2023-01-08 13:00 2.717%

In summary, on 2023/01/8 between the 8th and 14th hours, we lost an average of 2.83% of webrequest-text data (from eqsin datacenter only).

Recommendations

SInce we are unable to recover the lost data, we recommend going ahead to process the data received during the affected time period.We recommend excluding the affected data for the time period (2023/01/08 8:00 through 2023/01/08 14:00) from the analysis report.

Root Cause

The Eqsin datacenter experienced some network issues on the 8th of January 2023.

Affected Datasets

  • Web Requests (webrequest)
  • Pageviews ( pageview_hourly, pageview_actor)
  • Projectview hourly
  • Mediacounts
  • Uniques Devices
  • Browser general
  • Mediawiki API request
  • Mobile apps sessions and uniques metrics
  • Interlanguage navigation

Followup Steps

No followup steps since the root cause is beyond the data engineering team’s control.

Reproducing the Issues

Data loss can be verified using this script:

sudo -u analytics kerberos-run-command analytics spark2-sql --master yarn -S \
         --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar \
         -f /srv/deployment/analytics/refinery/hive/webrequest/check_dataloss_false_positives.sparksql \
         -d table_name=TABLE            \
         -d webrequest_source=SOURCE    \
         -d year=YEAR                   \
         -d month=MONTH                 \
         -d day=DAY                     \
         -d hour=HOUR

If the output of this query contains rows that have the false_positive field to false, there is real data-loss