Analytics/Data Lake/Traffic/Unique Devices

From Wikitech
Jump to: navigation, search

How is this data computed

We compute this data using the Last-Access cookie. For details see Analytics/Unique_Devices/Last_access_solution and m:Research:Unique Devices.

Tables schema

As of 2017-07, there are 4 'unique devices' tables available in the wmf database on Hive:

  • unique_devices_per_domain_daily stores unique devices counts per domain (e.g. en.m.wikipedia.org) split by country per day
  • unique_devices_per_domain_monthly stores unique devices counts per domain split by country per month
  • unique_devices_per_project_family_daily stores unique devices counts per project (e.g. Wikipedia) split by country per day
  • unique_devices_per_project_family_monthly stores unique devices counts per project split by country per month
unique_devices_per_domain_daily / unique_devices_per_domain_monthly
domain string Lower cased domain accessed (en.wikipedia.org for instance)
country string Country name of the accessing agents (computed using maxmind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under estimation of unique devices based on Last-Access cookie, and the nocookies header. Unique Devices that came to a given host at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for the unique_devices_..._dailytables)
unique_devices_per_project_family_daily / unique_devices_per_project_family_monthly
project_family string Lower cased project accessed (Wikipedia or Wikivoyage for instance)
country string Country name of the accessing agents (computed using the MaxMind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under-estimation of unique devices based on the Last-Access global cookie and the nocookies header. Unique Devices that came to a given project family at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for last_access_uniques_global_daily)

Sample query to get total uniques for a given host or project_family for a day

SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_domain_daily
WHERE year=2015 AND month=12 AND day=24
  AND domain = 'es.wikipedia.org';
SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_project_family_daily
WHERE year=2017 AND month=4 AND day=1
  AND project_family = 'wikipedia';

Data Quality

The Last-Access based uniques metric has proven having a lot of variability for small projects.

Please read Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#Data_Quality_Analysis.

Changes and Known Problems with Dataset

  • 2016-02-19: Monthly per-domain data is available as of January 2016.
Date from Date until Task Details
start 2017-05-18 Task T165661 Per-domain unique-devices computation excluded countries that didn't have either underestimates or offset until 2017-05-18.
start 2017-06-11 Task T167005 Per-Domain unique-devices computation was under-counting fresh sessions (offset) by about 10% until 2017-06-11.
2016-11-04 2017-02-14 Task T165560 Artificial spike in offset of unique devices from November to February on wikidata likely related to varnish4 rollout

See also