Analytics/Data Lake/Traffic/Unique Devices/Last access solution

From Wikitech
Jump to: navigation, search

Objective

The WMF Analytics team counts unique devices per-domain and per project-family, daily and monthly, in a way that does not uniquely identify, fingerprint or otherwise track users. The outcome is reports on the number of Unique Devices per domain or project for a given month or day. This is achieved by setting cookies with a Last-Access day on clients and counting sightings of browsers with an old cookie or no cookie at all.

Deliverable

Reports in the following format:

monthly daily
per-domain
Domain 2016-03 2016-04 ...
en.wikipedia.org 200,000,000 210,000,000 ...
es.m.wikipedia.org 20,000,000 21,000,000 ...
en.wikisource.org 2,000,000 2,100,000 ...
es.wikisource.org 200,000 210,000 ...
... ... ... ...
Domain 2016-03-01 2016-03-02 ...
en.wikipedia.org 200,000,000 210,000,000 ...
es.m.wikipedia.org 20,000,000 21,000,000 ...
en.wikisource.org 2,000,000 2,100,000 ...
es.wikisource.org 200,000 210,000 ...
... ... ... ...
project family
Host 2017-03 2017-04 ...
wikipedia.org 200,000,000 210,000,000 ...
wikisource.org 2,000,000 2,100,000 ...
wikivoyage.org 200,000 210,000 ...
... ... ... ...
Host 2017-03-01 2017-03-02 ...
wikipedia 200,000,000 210,000,000 ...
wikisource 2,000,000 2,100,000 ...
wikivoyage 200,000 210,000 ...
... ... ... ...

Caveats

Bots

We need to filter Bots in our report as the cookie system will over count them. A bot request might not accept cookies thus counting as distinct every time.

This is easier said that done but we use requests tagged with 'nocookies' as a means to identify the percentage of our traffic that comes from bots not tagged as such.

https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution/BotResearch

Redirects

Redirects (http response codes 301, 302 and 307) were originally filtered out from the unique devices computation. While this is the case for the per-domain unique devices, they have to be included in the per-project-family computation. See Technicalities/ for more details.

Privacy

  • Users can delete or refuse cookies
  • We are not be able to identify users from the data passed in the cookie. The cookie contains only a year, month and day.
  • We comply with Wikimedia's Privacy Policy

Technicalities

In order to produce the above report these are the cookies we need, each cookie stores last access time per project.

  • WMF-Last-Access:
<<language>>.m.<<project>>.org
  Mobile site uniques for <<project>> and <<language>>
<<language>>.<<project>>.org
  desktop site uniques for <<project>> and <<language>>
  • WMF-Last-Access-Global:
*.<<project>>.org
  Uniques for <<project>> 

How will we be counting: Plain English

Unique devices are computing by adding two numbers, one derived from WMF-Last-Access or WMF-Last-Access-Global cookie and an offset. (In the rest of this section, we will use WMF-Last-Access, knowing that the same mechanism applies to WMF-Last-Access-Global.)

A very high level explanation of how this works can be found on our blog: https://blog.wikimedia.org/2016/03/30/unique-devices-dataset/.

For a more technical explanation you can keep on reading.

Using value of WMF-Last-Access cookie

Inside Varnish we set the cookies and alter the X-Analytics header. Two possible cases per cookie:

1) Request comes in, user does not have a WMF-Last-Access cookie: We issue one with last access date that includes day/month with a future expire time (any expire time over a month will work). Cookie value is "14-Dec-2015" for example.

2) Request comes in, user already has a WMF-Last-Access cookie: We re-issue a new cookie with a future expiration date and set the old date as the value of the cookie in the X-Analytics header. In our prior example, one day has gone by among requests, value of cookie is reset to "15-Dec-2015" and we store the following in the x-analytics hash:

X-analytics["WMF-Last-Access"] = "14-Dec-2015"

In order to count unique devices in the cluster, we get from the webrequest table all requests for, say, January that do not have a January date set on x-analytics["WMF-Last-Access"] (this includes requests without any date at all). All those are January uniques, cause those are requests that came in in January without a January date in the WMF-Last-Access cookie.

Same logic for daily: to count uniques on December 15th we will get all requests for December 15th that have on X-analytics["WMF-Last-Access"] value an older date than December 15th. So the request in our example above will be counted. Those are uniques for December 15th.

Note that this method of counting assumes that requests come from real users that accept cookies, so we are assuming that if we set a cookie we are going to be able to retrieve it in a subsequent request. This is true only in the case of browser clients that accept cookies. Although while counting we are only looking at traffic tagged as "user" in the cluster, we have to be aware of bots that are not reported as such. In order to discount those requests, we only count requests that have nocookie=0, meaning that those requests came to us with 'some' cookie set. This method of counting, by definition, underreports users as we will not be counting users with a fresh session or users browsing without cookies.


Nocookie Offset

Per x-analytics documentation every request that comes in without cookies whatsoever is tagged with nocookie=1. These are requests are either bots, users browsing with cookies off or users using an "incognito" mode and thus a fresh browser session. We did some research on this, and it turns out that nocookie=1 is a cheap proxy to rule out a bunch of what might be bot traffic, see Analytics/Unique_clients/Last_access_solution/BotResearch.

When possible, we want to make sure that we count devices that might be coming to Wikipedia with a fresh session without cookies at all. Thus, we also count as uniques requests with nocookie=1 whose signature appears only once in a day or month. The signature is calculated with a hash of (ip, user_agent, accept_language) per project. The idea behind this reasoning is that -if you are a real user- for the day and you did not refresh your browser session, there is only one request you could do without cookies, the 1st one. Subsequent request will be sending the WMF-Last-Access cookie.

This methodology has two caveats:

  • It underreports fresh sessions in mobile (due to NAT-ing of IP addresses that are shared among many users.see: [1] and [2])
  • It will overreport a device in which the IP is changed frequently and/or cookies are deleted frequently as it will appear as two different fresh sessions to our logic. This is less prevalence of an occurrence than the underreport on mobile described prior.

We add this offset to the numbers that result from looking at WMF-Last-Access cookie.

How big of a percentage does the offset represent from the total?

For projects with more than 100.000 uniques, the offset represents between 5% and 60% of the total if counted daily, variability is high depending on project. The offset represents a smaller percentage on mobile domains, as we know numbers are underreported for fresh sessions in mobile. The offset also represents a higher percentage of monthly numbers, as fresh sessions are likely to be more numerous as we expand our timeperiod for counting them.

Offsets percentage daily.png
Offsets percentage monthly.png


One other thing to keep in mind about the offset is that a high for offset doesn't mean bad quality data. In fact, for project-families having fact-checking mostly usage pattern (wiktionnary is a good example, once every now and then, you get there to check for spelling or existence, but it's not a usual pattern to follow inner-links on wiktionnaries as it is on wikipedias), having (many) more offsets than cookie-based unique devices is expected.

The redirect issue on unique device counts for project families

When a mobile device with a fresh session (no cookies) visits the desktop version of one of our projects, for example www.wikidata.org, it gets redirected with a 302 to the mobile version of the website, here m.wikidata.org. In this transaction two cookies are set:

1. Cookie on global domain (*.wikidata.org) when server responds with 302
2. Cookie set on m.wikidata.org on the 200 response from the Wikidata mobile site.

Our per-domain computation filters 301/302 requests (as those are not pageviews). That works well in the per-domain case, as the cookie is set on the 200 response. But it doesn't work for global domain, as the cookie is being set "earlier".

In our example, if we filter out redirects for the project-family computation (counting devices on *.wikidata.org) and we only count the 200 responses (pageviews), we would be missing fresh sessions that exhibit the behaviour described above. In order to solve that issue, in June 2017 we updated the filter for project-family unique devices computation to accept redirects that lead to a pageview (phab:T167005).

Data Quality Analysis

Unique devices per-domain

That is unique devices for en.m.wikimedia.org (mobile site) or en.wikimedia.org (desktop site).

We recommend that if you are using the unique devices number per domain, you consider domains having at least 1000 unique devices daily (one thousand).  Domains with less than 1000 unique devices show too much random variation for data to be actionable. In other words, data is too noisy if the number of unique devices is less than 1000 daily.
  • Depending on the project family you are looking at (Wikipedia or Wiktionary for instance), it is interesting to keep an eye on the uniques_offset. For projects where you can follow links (like wikipedia), uniques_offset is less important, but for projects mostly used for fact checking (wiktionnary for instance), uniques_offset represents a very wider portion of the total uniques.
  • We also recommend to be very aware of the variability issues when looking at uniques per country (only for WMF employees, or people under NDA) as the variability per host per country is higher than variability per host.

How did we determined this 1000 number for the per-domain uniques

We ran an analysis on our daily dataset trying to measure randomness, taking into account the weekly rhythm of our traffic. We took 9 weeks of data, among which one week is taken as reference to compute variation of the eight other weeks day per day per domain.

Example: On Dec 23, a Friday we compute for es.m.wikipedia how much the number of unique devices we have (say, a 100) differs from the number of unique devices on Friday on reference week (say, 110). If this difference is 10 our variation is 10% or 0.1. Since we have 8 weeks of data plus one week of reference we have a series of 8 points per day of the week. Given this series we compute the standard deviations of all variations on the 8 week period per domain. Small standard deviations are good, as it means that variation is acceptable, standard deviations >1 mean that data varied more than 100% from the reference week. We consider that too be too large and a sign that the data is too "noisy".

We also computed, for each project, the median value of unique devices over the 9 weeks of data.

This data is plotted below, using log scales.

Uniques per domain-variation analysis.png

In this chart the red line represents the mean of unique_devices estimate, projects are plotted from higher to lower, that is, the left side of screen has projects like es.wikipedia.org which have millions of users and a variation of less than 0.1 (10%). The right side of the graph plots projects with a small number of uniques, the blue line represents the variation and it can easily be seen how it is a lot higher for those projects, sometimes as high as 100% (a std deviation of 1 in this case).

Unique devices per project-family

Daily per-project-family unique devices (on *.wikimedia.org for instance) display less variation, in the plot below we can see that standard deviations are a lot lower. In the case of wikipedia.org domain the variability is about 2%. Now, these calculations aggregate results for a project family (example: *.wikipedia.org). If you are splitting results further (per country, for example) you should take into account that variability will be higher.

Uniques project wide-variation analysis.png

Numbers

Variation calculations are done with data from March and April 2017, week of reference being from March 1st to March 7th.

Remember that the standard deviation is computed on a variation, meaning a value of 0.02 is actually a variation of 2% of the total number of uniques.

Host Standard deviation Median Total uniques estimate
wikipedia 0.023682 139541265.0 6892468332
wiktionary 0.044993 2065876.0 101815953
wikimedia 0.511323 737828.0 37875366
wikibooks 0.059730 486944.0 23910646
wikiquote 0.064770 371341.0 18325283
wikisource 0.051465 304530.5 15110349
wikiversity 0.092480 85188.0 4190639
wikidata 0.051855 67684.5 3361208
wikivoyage 0.067449 58717.0 2926678
wikimediafoundation 0.145124 40250.0 2067248
mediawiki 0.115052 28805.5 1325261
wikinews 0.275252 17114.0 873688

More docs