Analytics/Data Lake/Traffic/Unique Devices/Last access solution/Validation

This article describes all the work done in validating the numbers generated using the last access cookie. In this article, users = devices.

Why validate?

The WMF-Last-Access cookie comes from the X-Analytics request header into our refined webrequest logs. The value of the cookie could be null, or a date, like 2015-06-30. The number of uniques for a day, can be estimated by looking at all the requests with the cookie set to null, or a date less than the date you are calculating for. Given that the cookie expires over a 31 day period, the Nulls come from users who have never visited the site before, or their cookie has expired. The older dates come from people who are visiting the site for the first time today. In both these cases, we would respond back to their client, and set their cookie date to the current date, and requests that come from the user then on for the rest of the day, can be ignored.

However, if there is a bot that doesn't register as a spider, or if there is a user that refuses cookies, the value they send will always be Null, and each request will get counted as a unique. We attempted to see if we could figure out approximately what percentage our uniques count was off by, so we could offset our numbers by that much.

Terms

Overcount: This is the estimate of uniques we have, that is bloated by the high number of requests with null value set in the cookie, coming from bots/users who don't accept cookies. Hence the name, overcount. Overcount = Uniques estimated through the last access cookie method.

Repeats: In order to get a lower bound of uniques, we also calculate a number, ignoring the null counts, and only counting requests with an earlier date set. This would be the number of users who have visited the site repeatedly over an arbitrary 31 day period. Hence the name repeats. Even though this number is of no real significance, it helps establish the interval below which our uniques cannot be.

Other Caveats

The cookie is deployed per domain, so numbers across domains cannot be added up. For eg, to estimate uniques for a enwiki (including desktop and mobile), the numbers for en.wikipedia.org and en.m.wikipedia.org cannot be added up. As an extension, most of the numbers in this report are usually calculated for one domain, most commonly, en.wikipedia.org.

SQL to count uniques

This is the basic sql to count daily and monthly overcounts and repeats. Users with access to stat1002 can run these using hive or Hue.

Daily overcounts and repeats grouped by uri_host, for June 10 2015

USE wmf;

SELECT  uri_host,
                SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NULL
                OR (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') <
                        unix_timestamp(to_date(dt), 'yyyy-MM-dd'))),
                                1, 0)) AS overcount,
        SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NOT NULL
                AND (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') <
                     unix_timestamp(to_date(dt), 'yyyy-MM-dd'))),
                        1, 0)) AS repeats
FROM webrequest
WHERE x_analytics_map IS NOT NULL
  AND agent_type = 'user'
  AND is_pageview = TRUE
  AND webrequest_source IN ('mobile', 'text')
  AND YEAR = 2015
  AND MONTH = 6
  AND DAY = 10
GROUP BY uri_host
ORDER BY overcount DESC LIMIT 5000;

Monthly overcounts and repeats grouped by uri_host, for June 2015

USE wmf;
SELECT uri_host,
       SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NULL
                OR (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') < 
                  unix_timestamp(to_date(dt), 'yyyy-MM'))), 1, 0)) AS overcount,
       SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NOT NULL
                AND (unix_timestamp(last_access, 'dd-MMM-yyyy') < 
                  unix_timestamp(to_date(dt), 'yyyy-MM'))), 1, 0)) AS repeats
FROM webrequest
WHERE x_analytics_map IS NOT NULL
  AND agent_type = 'user'
  AND is_pageview = TRUE
  AND webrequest_source IN ('mobile',
                            'text')
  AND YEAR = 2015
  AND MONTH = 6
GROUP BY uri_host
ORDER BY overcount DESC LIMIT 5000;

Sample numbers for en.wikipedia.org from May - September 2015

month	day	Overcount	Repeats
5	1	92605826	13771272
5	2	78458424	10768860
5	3	79083954	12577781
5	31	46871121	16132685
6	1	58419722	22250586
6	2	60721513	22543029
6	30	45369844	18148775
7	1	44480024	17798867
7	2	46065819	17589004
7	30	48625679	17770151
7	31	48195861	17082829
8	1	38402477	13236340
8	2	36832098	14130095
8	30	37230296	13854064
8	31	46552884	19112208
9	1	48486743	19500402
9	9	61030474	19925116
9	10	53026323	19815455

Overcounts and repeat trend from May - September 2015 (en.wikipedia.org)

This graph plots the uniques from May 1 to September 10 2015. The huge spikes in August show how bot activity can have unpredictable impact on these numbers.

Monthly numbers from May-August 2015 for en.wikipedia.org

month	overcount	repeats
5	2,866,072,382	110,839,956
6	973,394,414	95,424,659
7	969,566,054	85,946,392
8	927,450,746	93,108,562

Validation approaches

There were few approaches taken to see if we can find a percentage range, that our overcount numbers are off by.

Fingerprinting for a short time

The first idea was to count uniques based on the assumption that the combination of an IP and a user-agent defined a user. For a short range of dates, we calculated overcount, fingerprinted, and repeat numbers for en.wikipedia.org to see what we could learn from it.

Since the cookie was only deployed on April 28th, the high overcount numbers on the first few days can be attributed to the fact that a lot of users came in without the cookie. It seems to stabilize over the last few days and all three lines follow the same trend. However, the overcount numbers are ~2.3 times the fingerprinted numbers. Fingerprinting itself, has multiple drawbacks, and when we look at daily numbers, we initially assumed fingerprinted numbers would be ~80% accurate. The overcount numbers being 2.3 times higher than the fingerprint numbers makes it inconclusive as to what the right unique counts are closer to. Further, this cannot be extended to mobile, because many users are assigned the same public IP. It cannot be extended to calculate monthly uniques either. If our overcount numbers and fingerprint numbers were close (off by 20% or so), we might have been able to observe what happened over a longer period, and maybe come to some conclusion. But the numbers we got didn't help us much.

Compare numbers from mobile UUID based uniques

We have existing uniques numbers for mobile clients (Android and iOS), and these are based on the wmfuuid value set in X-Analytics. We looked if our last access based uniques for mobile apps was similar. The drawback of this approach is that apps are not HTTP clients and do not have to handle cookies the same way as browsers. However we did find that in Android the numbers were pretty close, and off by only 5-6%. iOS was not consistent and inconclusive. But the Android trend tells us that our counting mechanism is valid, and in a bot-free universe, it would give us uniques.

Bot filtering

Regex based user-agent bot filtering

The first approach was to filter for bots using regex matching on the User agent string for commonly known bots. This is the regular expression we used to filter:

".*([Bb]ot|[Ss]pider|WordPress|AppEngine|AppleDictionaryService|Python-urllib|python-requests|Google-HTTP-Java-Client|[Ff]acebook|[Yy]ahoo|RockPeaks).*|Java/1\\..*|curl.*|PHP/.*|-|"

This was done on en.wikipedia.org for July 2015, and here's a snapshot of the results:

day	overcount	repeats	overcount_excluding_bots	repeat_excluding_bots
7/1/2015	44480024	17798867	43176159	17796810
7/1/2015	46065819	17589004	44641079	17586246
7/1/2015	39400385	14773861	38011278	14771951

The regex based filtering reduced the overcounted uniques by an average of 4.5% over all days in July. This is good, but not sufficient to narrow down our estimates.

How many requests makes a bot

Another short test we did was, assuming an ip/user agent combination made a user, if a user issued a large number of requests, say n, and the last access cookie was set to NULL, maybe they were requests coming from a bot. To estimate n, the distribution of the count of requests to enwiki desktop site was plotted, and the second highest peak of the distribution - 450 was assumed as the cut off. This test is harder to run, as it involves building a black list of all users with over 450 requests and no cookie value set, and excluding them while counting uniques. We tried this for three days, and the results looked like:

day	overcount_with_bots	repeat_with_bots	overcount_excluding_bots2	repeat_excluding_bots2	%(overcount with bots/overcount excluding bots)
7/1/2015	44480024	17798867	27942921	16380726	62.8212813
7/1/2015	46065819	17589004	26410269	15777171	57.33159547
7/1/2015	39400385	14773861	23167075	13789624	58.79910818

This difference is a lot more significant, and on the surface, promising. However, this assumption cannot be extended across wikis, especially smaller ones. It also cannot be extended to mobile devices, because 3G ip assignment implies a lot of users will share the same ip. It also cannot be extended to estimate monthly uniques.

Conclusion

At the moment, we've concluded that the variability in the number of bot requests is way too high to be able to determine a percentage range by which the last access cookie based uniques can be offset by. We have to pursue serious statistical analysis based bot detection on our request stream, or perhaps an alternate solution like deploying this cookie on Javascript. It does not seem possible to estimate uniques using this method otherwise.