Data Platform/Data Lake/Data Issues/2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic
Inflated unique device numbers in unique_devices_per_project_family_daily/monthly
Status | Closed | |
Severity | High | |
Business data steward | Omari Sefu | |
Technical data steward | Andreas Hoelzl | |
Incident coordinator | Omari Sefu, Andreas Hoelzl | |
Incident response team | Joseph Allemandou(lead data engineer), Antoine Quhen, Hamid Ghani (lead analyst), Maya Kampurath | |
Date detected | Sep 20, 2024 | |
Date resolved | Nov 1, 2024 Jan 27, 2025(2nd stage backfill) | |
Start of issue | June 2024 - start of current calculation
July 2024 - observed as a significant anomaly due to the change in traffic patterns | |
Phabricator ticket | T373630, T374655 (not public) |
Summary
In early September, the Research & Decision Science team detected inflated year-over-year growth of unique devices caused by bot traffic on page redirects (see monthly Movement Metrics) impacting the July and August 2024 timeframe.
unique_devices_per_project_family_daily|monthly
tables in the data lake use web traffic from is_redirect_to_pageview
web requests to compute unique device counts. However, our automata labeling is not applied to is_redirect_to_pageview
flagged web traffic.
In October a suspicious unique device count increase for Singapore was detected and escalated: T377257 (“automated traffic detection to be applied at the project family level”). The problem however also impacted other countries and regions, e.g. in the US, we estimate that pageviews were overcounted by approximately 4%.
Unique device counts were affected between July 2024 and November 2024. Numbers that were reported during those time periods were inflated.
The corrections and backfills of the data were completed at the end of January 2025. The backfills corrected data between August and November 2024. We were unable to backfill the July 2024 data, so those numbers remain inflated.
Impact assessment
Inflation due to inappropriate handling of bot traffic on page redirects | Inflation due to classifying bots on a per-domain basis rather than across projects | |
Estimated impact to global unique devices | ~11% | ~4% |
Estimated impact to US unique devices | 14% | 1% overcount
Between Aug and Nov, we estimate a 1% overcount of monthly unique devices in the US related to this issue. The majority of impacts are localized within Singapore |
Estimated impact to Singapore unique devices | 90% (7M→750k) | |
Estimated impact to global pageviews | None (redirects are not in pageviews) | ~3% (traffic will be recategorized as automated) |
Estimated impact to US pageviews | None | ~4% |
Time periods of the issue | Majority of issues occurred in July, August, some September. (October was affected but not as strongly) | Majority of issues occurred in September and October; some issues were in August and some in November |
Fix in place? | Since end of October / start of November | Since beginning of December |
Project family fix estimates already includes the redirect fix.
*July and August was mostly impacted by the redirect issue. September and October was mostly impacted by the per domain classification.
History
- Early September 2024: The RDS team detected inflated unique device metrics attributed to bot traffic on page redirects.
- September 20, 2024: An incident report was filed, prioritizing the issue under the title "2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic."
- October 1, 2024: The retention window for webrequest data was extended from 3 to 6 months to allow for an extended backfilling period, preserving data starting from July 2024.
- October 10, 2024: During further investigation, an additional system bug affecting webrequest data processing was identified and resolved: 2024-10-10 Webrequest Data Loss - Clobbered Hadoop Temporary Directory. Backfilling efforts commenced but did not include fixes for redirected bot traffic.
- October 15, 2024: A surge in unique device counts in Singapore was detected, indicating further misclassification of automated traffic. The issue was escalated under task T377257 for automated traffic detection at the project family level. Similar impacts were observed in other regions, with the U.S. experiencing an estimated 5% overcount.
- October 31, 2024: The DPE DE team deployed a fix for automated traffic detection, ensuring accurate data reporting from November 1, 2024, onward.
- December 4, 2024: A final fix for automated traffic detection at the project family level was implemented, with data backfilled to December 1, 2024, for consistency.
Recommendations
We recommended the solution of: applying our automata heuristics to is_redirect_to_pageview
web traffic. This requires:
- Extending the retention window to preserve affected data.
- Conducting impact analysis on historical data to determine if the proposed solution resolves the unique device spikes we currently observe.
- Applying the fix to the entire pipeline.
- Annotating the data anomaly or re-running the pipelines for the data we have in the extended retention window that were affected.
Description
By investigating into pageviews
and unique devices pipelines
we have observed the following that has helped explain the rise of unique devices:
- We identified significant spikes in unique device counts on certain days in July and August 2024, unlike any other month in the past year. These increases were observed exclusively in the "unique devices by project family" table and not in the "unique devices by domain" table, and they predominantly involved 'fresh sessions' (i.e., sessions where cookies were enabled but no cookie was found). Monthly unique device metrics use the unique devices by project family table. Notably, these spikes were also absent from the
pageview_hourly
data. - Upon reviewing the logic behind both tables, we found that the "unique devices by project family" table includes web requests flagged as either
is_pageview
oris_redirect_to_pageview
(redirects counted as pseudo-pageviews for tracking purposes), whereas the "unique devices by domain" table only accounts for is_pageview requests. - Further analysis of the fresh sessions unique to the project family table, which consisted solely of redirects, revealed approximately millions of unique devices linked to a small number of users. These users had unidentified device types and exhibited similar user_agent strings (for instance, certain actor_signatures were associated with up to 500k unique devices in a single day). These requests consistently targeted the same Wikipedia pages and were resolved with a 301 status code.
- These actors were not flagged as automated traffic, as our detection heuristics are applied exclusively to pageviews, not redirects. This oversight explains the disproportionate increase in unique devices during July and August, despite no corresponding rise in actual pageviews.
Root Cause
Automata heuristics are not being applied to requests that are flagged as is_redirect_to_pageview
.
Affected Datasets and Services
unique_devices_per_project_family_daily
unique_devices_per_project_family_monthly
Reproducing the Issues
Charts, queries, and samples of actor signatures exhibiting the behavior above are logged in this spreadsheet.
Resolution & Decision:
In addition to correcting the data going forward, we decided to backfill historical data with corrections where possible. This decision was reached through a review conducted by key stakeholders, including Kate Zimmerman, Olja Dimitrijevic, Omari Sefu, and Andreas Hoelzl.
Key Considerations:
- The first fix addressing bot traffic on redirects was deemed most critical.
- Not implementing this fix would compromise the Brazil data center analysis and reports needed for regulatory requirements.
- Not implementing a fix for historical data would impact unique device analysis over an extended period of time. It is common for analysis to consider year-over-year changes. Users of the data will observe incorrect data trends on a frequent basis 1-2 years out from the date of the data issue, and less frequently in the years following.
- Impact on HDFS Disk Space: Storing historical data for the backfill meant that disk space was nearing critically low levels (12.5%; 10% is considered critical). We needed a solution that would free up disk space before reaching critically low levels.
- Regulatory Compliance: Ensure accurate unique device reporting before regulatory deadlines.
- Engineering effort and resourcing for other projects.