Data Platform/Data Lake/Data Issues/2023-01-08 Webrequest Data Loss
Date issue detected: | 2023-01-08 |
Date resolved: | 2023-01-09 |
Start of issue: | 2023-01-08 |
Status: | Resolved |
Phabricator Ticket(s): | https://phabricator.wikimedia.org/T328354 |
Summary
On the 8TH of January 2023, some caching nodes failed to send some webrequest traffic resulting in loss of data between the hours of 0800 till 1400. We estimate this resulted in a 2.83% loss of webrequest-text data leading to an underreporting of pageviews and all webrequest related data.
Description
The caching nodes in the Eqsin datacenter failed to collect some traffic data between the hours of 8:00 and 14:00 on the 8th of January 2023. This webrequest-text traffic data was affected during this time period. The table below contains details data loss during the affected time period.
Time | Estimated Data loss |
2023-01-08 08:00 | 2.84% |
2023-01-08 09:00 | 2.66% |
2023-01-08 10:00 | 2.55% |
2023-01-08 11:00 | 2.916% |
2023-01-08 12:00 | 3.09% |
2023-01-08 13:00 | 3.08% |
2023-01-08 13:00 | 2.717% |
In summary, on 2023/01/8 between the 8th and 14th hours, we lost an average of 2.83% of webrequest-text data (from eqsin datacenter only).
Recommendations
SInce we are unable to recover the lost data, we recommend going ahead to process the data received during the affected time period.We recommend excluding the affected data for the time period (2023/01/08 8:00 through 2023/01/08 14:00) from the analysis report.
Root Cause
The Eqsin datacenter experienced some network issues on the 8th of January 2023.
Affected Datasets
- Web Requests (webrequest)
- Pageviews ( pageview_hourly, pageview_actor)
- Projectview hourly
- Mediacounts
- Uniques Devices
- Browser general
- Mediawiki API request
- Mobile apps sessions and uniques metrics
- Interlanguage navigation
Followup Steps
No followup steps since the root cause is beyond the data engineering team’s control.
Reproducing the Issues
Data loss can be verified using this script:
sudo -u analytics kerberos-run-command analytics spark2-sql --master yarn -S \
--jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar \
-f /srv/deployment/analytics/refinery/hive/webrequest/check_dataloss_false_positives.sparksql \
-d table_name=TABLE \
-d webrequest_source=SOURCE \
-d year=YEAR \
-d month=MONTH \
-d day=DAY \
-d hour=HOUR
If the output of this query contains rows that have the false_positive field to false, there is real data-loss