Jump to content

Data Platform/Data Lake/Data Issues/2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic

From Wikitech

Inflated unique device numbers in unique_devices_per_project_family_daily/monthly

Status Closed
Severity High
Business data steward Omari Sefu
Technical data steward Andreas Hoelzl
Incident coordinator Omari Sefu, Andreas Hoelzl
Incident response team Joseph Allemandou(lead data engineer), Antoine Quhen, Hamid Ghani (lead analyst), Maya Kampurath
Date detected Sep 20, 2024
Date resolved Nov 1, 2024 Jan 27, 2025(2nd stage backfill)
Start of issue June 2024 -  start of current calculation

July 2024 -  observed as a significant anomaly due to the change in traffic patterns

Phabricator ticket T373630, T374655 (not public)  

Summary

In early September, the Research & Decision Science team detected inflated year-over-year growth of unique devices caused by bot traffic on page redirects (see monthly Movement Metrics) impacting the July and August 2024 timeframe.

unique_devices_per_project_family_daily|monthly tables in the data lake use web traffic from is_redirect_to_pageview web requests to compute unique device counts.  However, our automata labeling is not applied to is_redirect_to_pageview flagged web traffic.  

In October a suspicious unique device count increase for Singapore was detected and escalated: T377257 (“automated traffic detection to be applied at the project family level”). The problem however also impacted other countries and regions, e.g. in the US, we estimate that pageviews were overcounted by approximately 4%.

Unique device counts were affected between July 2024 and November 2024. Numbers that were reported during those time periods were inflated.

The corrections and backfills of the data were completed at the end of January 2025. The backfills corrected data between August and November 2024. We were unable to backfill the July 2024 data, so those numbers remain inflated.

Impact assessment

Inflation due to inappropriate handling of bot traffic on page redirects Inflation due to classifying bots on a per-domain basis rather than across projects
Estimated impact to global unique devices ~11% ~4%
Estimated impact to US unique devices 14% 1% overcount

Between Aug and Nov, we estimate a 1% overcount of monthly unique devices in the US related to this issue. The majority of impacts are localized within Singapore

Estimated impact to Singapore  unique devices 90% (7M→750k)
Estimated impact to global pageviews None (redirects are not in pageviews) ~3% (traffic will be recategorized as automated)
Estimated impact to US pageviews None ~4%
Time periods of the issue Majority of issues occurred in July, August, some September. (October was affected but not as strongly) Majority of issues occurred in September and October; some issues were in August and some in November
Fix in place? Since end of October / start of November Since beginning of December

Project family fix estimates already includes the redirect fix.

*July and August was mostly impacted by the redirect issue. September and October was mostly impacted by the per domain classification.

History

  • Early September 2024: The RDS team detected inflated unique device metrics attributed to bot traffic on page redirects.
  • September 20, 2024: An incident report was filed, prioritizing the issue under the title "2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic."
  • October 1, 2024: The retention window for webrequest data was extended from 3 to 6 months to allow for an extended backfilling period, preserving data starting from July 2024.
  • October 10, 2024: During further investigation, an additional system bug affecting webrequest data processing was identified and resolved: 2024-10-10 Webrequest Data Loss - Clobbered Hadoop Temporary Directory. Backfilling efforts commenced but did not include fixes for redirected bot traffic.
  • October 15, 2024: A surge in unique device counts in Singapore was detected, indicating further misclassification of automated traffic. The issue was escalated under task T377257 for automated traffic detection at the project family level. Similar impacts were observed in other regions, with the U.S. experiencing an estimated 5% overcount.
  • October 31, 2024: The DPE DE team deployed a fix for automated traffic detection, ensuring accurate data reporting from November 1, 2024, onward.
  • December 4, 2024: A final fix for automated traffic detection at the project family level was implemented, with data backfilled to December 1, 2024, for consistency.

Recommendations

We recommended the solution of: applying our automata heuristics to is_redirect_to_pageview web traffic. This requires:

  1. Extending the retention window to preserve affected data.
  2. Conducting impact analysis on historical data to determine if the proposed solution resolves the unique device spikes we currently observe.
  3. Applying the fix to the entire pipeline.
  4. Annotating the data anomaly or re-running the pipelines for the data we have in the extended retention window that were affected.

Description

By investigating into pageviews and unique devices pipelines we have observed the following that has helped explain the rise of unique devices:

  • We identified significant spikes in unique device counts on certain days in July and August 2024, unlike any other month in the past year. These increases were observed exclusively in the "unique devices by project family" table and not in the "unique devices by domain" table, and they predominantly involved 'fresh sessions' (i.e., sessions where cookies were enabled but no cookie was found). Monthly unique device metrics use the unique devices by project family table. Notably, these spikes were also absent from the pageview_hourly data.
  • Upon reviewing the logic behind both tables, we found that the "unique devices by project family" table includes web requests flagged as either is_pageview or is_redirect_to_pageview (redirects counted as pseudo-pageviews for tracking purposes), whereas the "unique devices by domain" table only accounts for is_pageview requests.
  • Further analysis of the fresh sessions unique to the project family table, which consisted solely of redirects, revealed approximately millions of unique devices linked to a small number of users. These users had unidentified device types and exhibited similar user_agent strings (for instance, certain actor_signatures were associated with up to 500k unique devices in a single day). These requests consistently targeted the same Wikipedia pages and were resolved with a 301 status code.
  • These actors were not flagged as automated traffic, as our detection heuristics are applied exclusively to pageviews, not redirects. This oversight explains the disproportionate increase in unique devices during July and August, despite no corresponding rise in actual pageviews.

Root Cause

Automata heuristics are not being applied to requests that are flagged as is_redirect_to_pageview.

Affected Datasets and Services

unique_devices_per_project_family_daily

unique_devices_per_project_family_monthly

Reproducing the Issues

Charts, queries, and samples of actor signatures exhibiting the behavior above are logged in this spreadsheet.

Resolution & Decision:

In addition to correcting the data going forward, we decided to backfill historical data with corrections where possible. This decision was reached through a review conducted by key stakeholders, including Kate Zimmerman, Olja Dimitrijevic, Omari Sefu, and Andreas Hoelzl.

Key Considerations:

  • The first fix addressing bot traffic on redirects was deemed most critical.
  • Not implementing this fix would compromise the Brazil data center analysis and reports needed for regulatory requirements.
  • Not implementing a fix for historical data would impact unique device analysis over an extended period of time. It is common for analysis to consider year-over-year changes. Users of the data will observe incorrect data trends on a frequent basis 1-2 years out from the date of the data issue, and less frequently in the years following.
  • Impact on HDFS Disk Space: Storing historical data for the backfill meant that disk space was nearing critically low levels (12.5%; 10% is considered critical). We needed a solution that would free up disk space before reaching critically low levels.
  • Regulatory Compliance: Ensure accurate unique device reporting before regulatory deadlines.
  • Engineering effort and resourcing for other projects.