Analytics/Data Lake/Traffic/Unique Devices

From Wikitech

How is this data computed

We compute this data using the Last-Access cookie. For details see Analytics/Data Lake/Traffic/Unique Devices/Last access solution and m:Research:Unique Devices.

Tables schema

As of 2017-07, there are 4 'unique devices' tables available in the wmf database on Hive:

  • unique_devices_per_domain_daily stores unique devices counts per domain (e.g. en.m.wikipedia.org) split by country per day
  • unique_devices_per_domain_monthly stores unique devices counts per domain split by country per month
  • unique_devices_per_project_family_daily stores unique devices counts per project (e.g. Wikipedia) split by country per day
  • unique_devices_per_project_family_monthly stores unique devices counts per project split by country per month
unique_devices_per_domain_daily / unique_devices_per_domain_monthly
domain string Lower cased domain accessed (en.wikipedia.org for instance)
country string Country name of the accessing agents (computed using maxmind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under estimation of unique devices based on Last-Access cookie, and the nocookies header. Unique Devices that came to a given host at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for the unique_devices_..._dailytables)
unique_devices_per_project_family_daily / unique_devices_per_project_family_monthly
project_family string Lower cased project accessed (Wikipedia or Wikivoyage for instance)
country string Country name of the accessing agents (computed using the MaxMind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under-estimation of unique devices based on the Last-Access global cookie and the nocookies header. Unique Devices that came to a given project family at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for last_access_uniques_global_daily)

Sample query to get total uniques for a given host or project_family for a day

SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_domain_daily
WHERE year=2015 AND month=12 AND day=24
  AND domain = 'es.wikipedia.org';
SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_project_family_daily
WHERE year=2017 AND month=4 AND day=1
  AND project_family = 'wikipedia';

Data Quality

The Last-Access based uniques metric has proven having a lot of variability for small projects.

Please read Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#Data_Quality_Analysis.

Changes and Known Problems with Dataset

  • 2016-02-19: Monthly per-domain data is available as of January 2016.
Date from Date until Task Details
Feb 9, 2021 June 30, 2022 task T316572 Unique devices by family metrics has been overcounted by approx ~5% globally. For more details, read Analytics/Data Lake/Data Issues/2021-02-09 Unique Devices By Family Overcount
2020-06-24 (daily) / 2020-06-01 (monthly) task T250744 Quality improvement through removal of automated traffic. See Analytics/Data Lake/Traffic/Unique Devices/Automated traffic correction
2018-05-30 2018-06-03 task T199517 June Unique devices increase of 170% for wikidata
start 2017-05-18 task T165661 Per-domain unique-devices computation excluded countries that didn't have either underestimates or offset until 2017-05-18.
start 2017-06-11 task T167005 Per-Domain unique-devices computation was under-counting fresh sessions (offset) by about 10% until 2017-06-11.
2016-11-04 2017-02-14 task T165560 Artificial spike in offset of unique devices from November to February on wikidata likely related to varnish4 rollout

See also