Analytics/Data Lake/Data Issues/2021-02-09 Unique Devices By Family Overcount

From Wikitech
Date issue detected: 2022-08-24
Date resolved: 2022-09-27
Start date: 2021-02-09
End date: 2022-06-30
Status: Resolved
Phabricator Ticket(s): https://phabricator.wikimedia.org/T316572

https://phabricator.wikimedia.org/T305841 https://phabricator.wikimedia.org/T319213

Summary

Unique devices by family metrics has been overcounted by approx ~5% globally between Feb 9, 2021 and June 30, 2022

Description

During the August 2022 migration of the unique devices data pipeline from Oozie/Hive to Airflow/Spark (T305841), due to a difference in the resulting datasets, we established that some of the unique devices metrics were overcounted since Feb, 2021. This overcount data issue:

  • affects only the unique_devices_per_project_family data sets
  • affects only ~50% of the data (reference)
  • is greater in smaller buckets like “Wikiversity in Taiwan” (overcounting ~50%), but still significant in bigger buckets like “Wikipedia in the US” (overcounting ~6%)
  • amounts to approximately ~5% overcount across all wikis

It is important to understand that separately from the above, the traffic data loss that extended from June 2021 until January 2022 affected unique devices, in addition to pageviews. There isn’t a good way to adjust for that data.

Stats for the overcount are available in this spreadsheet.

Root Cause

A Hive 2.5.3 bug (HIVE-21778) affected the computation of unique devices (code). The bug was introduced with the Hadoop cluster upgrade from CDH to BigTop (T244499) on Feb 2nd 2021.

The overcount is due to counting as unique devices requests that are TLS redirects (http→https), which don’t have the x-analytics header set. The x-analytics header not being present triggered the bug in Hive which failed in filtering out those requests.

Logic for setting the X-analytics header:

https://github.com/wikimedia/puppet/blob/eac5e68b7d2b9952f9f18de46f82b23caa0cfb1b/modules/varnish/templates/analytics.inc.vcl.erb#L253

Hive issues:

https://issues.apache.org/jira/browse/HIVE-21778

https://dba.stackexchange.com/questions/271571/testing-a-hive-array-for-is-null-says-not-null

Affected Datasets

Hive

  • wmf.unique_devices_per_project_family_daily
  • wmf.unique_devices_per_project_family_monthly

Druid

  • unique_devices_per_project_family_daily
  • unique_devices_per_project_family_monthly

Cassandra

  • local_group_default_T_unique_devices

Unique_devices_per_domain_daily/monthly datasets in Hive and Druid have not been affected.

There are no other datasets with code managed under source control affected by the Hive bug (not necessarily related to unique_devices). However there may be other hive code and corresponding datasets that are not managed in the code repository that could be affected and analysts are advised to do their own due diligence.

Affected services

Some services consumed the data sets above during the period it was affected.

Only requests with project=’all-wikipedia-projects’

(or any other project family like ‘all-wiktionary-projects’) were affected.

  • Wikistats2 unique devices https://stats.wikimedia.org/

Only views with project=’all-wikipedia-projects’

(or any other project family like ‘all-wiktionary-projects’) were affected.

  • A number of charts in Superset

Only those using Hive’s or Druid’s unique_devices_per_project_family_(daily|monthly) were affected.

Remediation

While this issue was detected in 2022-08, data for 2022-07 and 2022-08 which was already affected was deleted and backfilled with the correct data (https://phabricator.wikimedia.org/T319213). This correction was applied on the following Hive datasets at the end of Sep 2022 :

  • wmf.unique_devices_per_project_family_daily
  • wmf.unique_devices_per_project_family_monthly

Reproducing the bug

This is a reproduction of the Hive bug using HQL in a Hive interactive command line. Note that the field x_analytics_map of the wmf.pageview_actor table is of type map<string,string>.

 -- We first count the number of records of a given table for a particular day.
 SELECT count(*) FROM wmf.pageview_actor WHERE year=2022 AND month=8 AND day=14;
 -- returns 878866438
 
 -- We repeat the query, this time counting only records with the field x_analytics_map == NULL.
 SELECT count(*) FROM wmf.pageview_actor WHERE year=2022 AND month=8 AND day=14 AND x_analytics_map IS NULL;
 -- returns  23486331
 
 -- We repeat, counting now the records with the field x_analytics_map != NULL.
 SELECT count(*) FROM wmf.pageview_actor WHERE year=2022 AND month=8 AND day=14 AND x_analytics_map IS NOT NULL;
 -- returns 878866438
 
 -- Note that Hive finds 23486331 records with x_analytics_map == NULL,
 -- but then it says all records (878866438) have x_analytics_map != NULL.
 
 -- Other ways of using the IS NULL / IS NOT NULL operators seem to fail as well.
 SELECT x_analytics_map IS NULL, count(*) FROM wmf.pageview_actor WHERE year=2022 AND month=8 AND day=14 GROUP BY x_analytics_map IS NULL;
 -- returns false 878866438
 SELECT x_analytics_map IS NOT NULL, count(*) as freq FROM wmf.pageview_actor WHERE year=2022 AND month=8 AND day=14 GROUP BY x_analytics_map IS NOT NULL;
 -- returns true  878866438
 -- None of those queries indicates the presence of 23486331 rows with x_analytics_map == NULL.


https://github.com/wikimedia/analytics-refinery/blob/master/oozie/unique_devices/per_project_family/daily/unique_devices_per_project_family_daily.hql#L35