Analytics/Unique Devices/Last access solution

From Wikitech
Jump to: navigation, search

Objective

The WMF Analytics team counts unique devices per project per day and month in a way that does not uniquely identify, fingerprint or otherwise track users. The outcome is reports on the number of Unique Devices per project for a given month or day. This is achieved by setting cookies with a Last-Access day on clients and counting sightings of browsers with an old cookie or no cookie at all.

Deliverable

A report in the following format:

Bucket 2015-03 2015-04 ...
en.wikipedia 200,000,000 210,000,000 ...
es.wikipedia 20,000,000 21,000,000 ...
en.wikisource 2,000,000 2,100,000 ...
es.wikisource 200,000 210,000 ...
... ... ... ...
overall total (not deduplicated across projects) 500,000,000 510,000,000 ...

Caveats

  • To report uniques per project (wiki), we set a WMF-Last-Access cookie per project.
  • The reported overall total is a sum of unique devices to each domain and includes duplicates (the same browser on the same computer visits multiple wikis). We do not think it is possible to de-duplicate with a Last-Access approach because we do not have a common ending for all our domains (like *.wikipedia.org). For example, cookies for *.wikipedia.org and *.wikidata.org cannot be shared. To count uniquely across domains we would need another domain (central.wikipedia.org) and a set of redirects among our domains to this centralized place to set cookies.

Bots

We need to filter Bots in our report as the cookie system will over count them. A bot request might not accept cookies thus counting as distinct every time. This is easier said that done but we use requests tagged with 'nocookies' as a means to identify the percentage of our traffic that comes from bots not tagged as such. https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution/BotResearch

Privacy

  • Users can delete or refuse cookies
  • We are not be able to identify users from the data passed in the cookie. The cookie contains only a year, month and day.
  • We comply with Wikimedia's Privacy Policy

Technicalities

In order to produce the above report these are the cookies we need, each cookie stores last access time per project.

WMF-Last-Access:

<<language>>.m.<<project>>.org
mobile site uniques for <<project>> and <<language>>
<<language>>.<<project>>.org
desktop site uniques for <<project>> and <<language>>


How will we be counting: Plain English

Unique devices are computing by adding two numbers, one derived from WMF-Last-Access cookie and an offset.

A very high level explanation of how this works can be found here: https://blog.wikimedia.org/2016/03/30/unique-devices-dataset/. For a more technical explanation you can keep on reading.

Using value of WMF-Last-Access cookie

Inside Varnish we set the cookies and alter the [ X-Analytics https://wikitech.wikimedia.org/wiki/X-Analytics] header. Two possible cases per cookie:

1) Request comes in, if the user does not have a WMF-Last-Access cookie we issue one with last access date that includes day/month with a future expire time (any expire time over a month will work). Cookie value is "14-Dec-2015" for example.

2) Request comes in, user already has a WMF-Last-Access cookie. We re-issue a new cookie with a future expiration date and set the old date as the value of the cookie in the X-Analytics header. In our prior example, one day has gone by among requests, value of cookie is reset to "15-Dec-2015" and we store the following in the x-analytics hash:

X-analytics["WMF-Last-Access"] = "14-Dec-2015"

In order to count unique devices in the cluster, we get from the webrequest table all requests for, say, January that do not have a January date set on x-analytics["WMF-Last-Access"] (this includes requests without any date at all). All those are January uniques, cause those are requests that came in in January without a January date in the WMF-Last-Access cookie.

Same logic for daily: to count uniques on December 15th we will get all requests for December 15th that have on X-analytics["WMF-Last-Access"] value an older date than December 15th. So the request on our example above will be counted. Those are uniques for December 15th.

Note that this method of counting assumes that requests come from real users that accept cookies, so we are assuming that if we set a cookie we are going to be able to retrieve it in a subsequent request. This is true only in the case of browser clients that accept cookies. While it is true that while counting we are only looking at traffic tagged as "user" in the cluster we have to be aware of bots that are not reported as such. In order to discount those requests we only count requests that have nocookie=0, meaning that those requests came to us with 'some' cookie set. This method of counting, by definition, underreports users as we will not be counting users with a fresh session or users browsing without cookies.


Nocookie Offset

Per x-analytics documentation every request that comes in without cookies whatsoever is tagged with nocookie=1 These are requests are either bots, users browsing with cookies off or users using an "incognito" mode and thus a fresh browser session. We did some research on this regard and it turns out that nocookie=1 is a cheap proxy to rule out a bunch of what might be bot traffic, see Analytics/Unique_clients/Last_access_solution/BotResearch.

When possible, we want to make sure that we count devices that might be coming to wikipedia with a fresh session without cookies at all. Thus, we also count as uniques requests with nocookie=1 whose signature appears only once in a day or month. The signature is calculated with a hash of (ip, user_agent, accept_language) per project. The idea behind this reasoning is that -if you are a real user- for the day and you did not refresh your browser session, there is only one request you could do without cookies, the 1st one. Subsequent request will be sending the WMF-Last-Access cookie.

This methodology has two caveats:

  • It underreports fresh sessions in mobile (due to NAT-ing of IP addresses that are shared among many users.see: [1] and [2])
  • It will overreport a device in which the IP is changed frequently and/or cookies are deleted frequently as it will appear as two different fresh sessions to our logic. This is less prevalence of an occurrence than the underreport on mobile described prior.

We add this offset to the numbers that result from looking at WMF-Last-Access cookie.

How big of a percentage does the offset represent from the total?

For projects with more than 100.000 uniques, the offset represents between 5% and 60% of the total if counted daily, variability is high depending on project. The offset represents a smaller percentage on mobile domains, as we know numbers are underreported for fresh sessions in mobile. The offset also represents a higher percentage of monthly numbers, as fresh sessions are likely to be more numerous as we expand our timeperiod for counting them.

Offsets percentage daily.png
Offsets percentage monthly.png

Data Quality Analysis

We recommend that if you are using the unique devices number, you consider projects having at least 1000 uniques daily (one thousand). cause projects with less than 1000 uniques show too much random variation for data to be actionable. In other words, data is too noisy if the number of uniques is <1000.

How did we determined this 1000 number

We ran an analysis on our daily dataset trying to measure randomness, taking into account the weekly rhythm of our traffic. We took 6 weeks of data, among which one week is taken as reference to compute variation of the five other weeks day per day per domain.

Example: On Dec 23, a Friday we compute for es.m.wikipedia how much the number of unique devices we have (say, a 100) differs from the number of unique devices on Friday on reference week (say, 110). If this difference is 10 our variation is 10% or 0.1. We compute standard deviations of all variations on the 6 week period per domain. Small standard deviations are good, as it means that variation is acceptable, standard deviations >1 mean that data varied more than 100% from the reference week. We consider that too be too large and a sign that the data is too "noisy".

We also computed, for each project, the median value of unique devices over the 6 weeks of data.

This data is plotted below, using log scales.

Daily uniques devices quality analysis.png

This chart shows that projects with "less" unique devices show "more" variation.

Future work?

Deploy last-access cookie on (e.g.) *.wikipedia.org to count "per project" across all languages: phab:T138027.

More docs

  • 15 min tech talk (minute 19th): [3] and slides: [4]
  • The actual queries used to calculate the daily and monthly numbers from the webrequest table