Analytics/Data Lake/Traffic/Pageviews/Bots Research

From Wikitech
Jump to: navigation, search

Goal: Estimating undetected Bot traffic and users without cookies/incognito mode in pageview data

Our pageview filtering only tags as bots requests that identify as a such in the user agent. We know this method does not catch a bunch of other (perhaps malicious) bot traffic. We tag traffic that comes without cookies as nocookie=1. The objective of this brief research is to estimate how much of this nocookie traffic could correspond to bots and how much could correspond to users landing in our site with a fresh session without cookies or cookies disabled.

Results

Bots

In the hour of pageview data we looked at (for all projects) at least 1.7% of our pageviews on desktop marked as coming from "users" are really coming from automated traffic.

For the hour of traffic we have looked at for all projects the "user nocookie traffic" is about 15% of all pageviews. Of this 15% of nocookie traffic, counting only "self-identified" robot traffic with user agents like 'any' we are looking at 'at least' 11% of our user pageviews labeled with "nocookie=1" coming from bots. This represents about 1.7% of total pageviews for the hour.

Requests with no cookies

We assume that any unique signature with 1 request in an hour is a true user. This leaves us with the likehood that 6.7% of our pageviews are coming from users without any cookies (this represents about 43% of traffic tagged as nocookie), because they are either browsing with cookies disabled, using incognito mode or they have not visited wikipedia for a while and thus our cookies have expired. Note that (other than users browsing with their cookies disabled) it is only the first pageview from a user that can be tagged nocookies=1. Subsequent pageviews will have cookies set.

.. and the rest?...

Well, for user agents like "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36" doing 8000+ requests an hour we need to do a bit more research to see in what bucket they land. These requests amount to about 55% of our nocookie traffic which is quite significant.

Methodology

We get 1 hour of requests marked as pageviews for desktop for all projects excluding pageviews tagged as coming from bots but including pageviews marked with nocookie=1. We bucket data computing a unique signature (hash(ip, user_agent, accept_language)) and we compute the number of requests on that unique signature on that hour.

We look more deeply unique signatures that have more than 100 requests per hour to see automated traffic.

We compare this traffic with overall traffic for all projects in that hour to get an idea of how big of a percentage is the nocookie traffic versus the cookie traffic.

We compare our hourly numbers with daily numbers for quality checking, they are very similar.

Details

We have many signatures with more that 100 pageviews per hour with nocookie=1. See below top requester's for the hour, 1st number is the number of pageviews for that user agent. As it can be seen some of those are obviously bots but for some it is hard to say. Note the huge number of pageviews.

73542 any
51348 testAgent
44940 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/40.0
28129 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0
17283 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; fr-fr) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10
17050 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13
14247 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
12920 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36
9918 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13


For some user agents that look like bots we have calculated the percentage of requests the bot-look-alike is responsible for in an hour, for nocookie traffic and overall traffic labeled as "user". That is, on the given hour, bot with user agent "any" below is responsible for 0.92% of overall pageviews labelled as user and 4.57% of what we are labeling as "user" traffic with nocookie=1. Remember our data includes all projects.


Unique signatures with loads of requests for a given hour, of the signature only UA is printed
 % total pageviews  % nocookie=1 pageviews Number of requests User Agent
0.92% 4.57% 73542 any
0.64% 3.19% 51348 testAgent
0.20% 0.99% 15916 Blackboard Safeassign
0.12% 0.60% 9579 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.11% 0.56% 8977 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.11% 0.55% 8781 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.10% 0.52% 8318 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.09% 0.47% 7574 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.09% 0.44% 7123 Mozilla
0.09% 0.43% 6856 Weberknecht
0.07% 0.33% 5311 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.06% 0.31% 4994 Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
0.06% 0.29% 4657 Mail.ru/2.0

Code

Compute signatures and number of pageviews per signature for traffic for all projects marked with nocookie=1

use wmf;
with uniques as (
select hash(ip, user_agent, accept_language) as id, ip, user_agent, uri_path
from
webrequest where
                 access_method="desktop" and x_analytics_map['nocookies']=1
                 and is_pageview=1 and agent_type="user" and
                 (year=2015 and month=12 and day=01 and hour=01) )

select id, count(*), ip, user_agent from uniques group by id, ip, user_agent;

Calculate how many signatures only appear once within the hour

If file format is like: <signature>, <request_count>, <ip> , <user_agent>

-2132421387	1	xx.x.x.xxx	 Mozilla/5.0 (Windows NT 6.1; rv:41.0) Gecko/20100101 Firefox/41.0
more data_nocookie.txt | awk '{print $2}' | egrep '^1$' | wc -l

Sort data numerically according to second column

sort -nk2 data.txt > sorted_data.txt

Remove 'obvious' good browsers This is so we have a pool of browsers that can be consider "possible bots" looking at the user agent, a rough estimate but a valid one for a bottom line.

more  sorted_data.txt | awk -F"\t" '{print $2" "$4}' | egrep -v 'Apple|Chrome|Windows NT|Firefox|MSIE' > possible_bots.txt

Count of the possible bots how many have more than 100 pageviews

more possible_bots.txt | awk '{print $1}'| awk '{if ($1 > 100) print $1}' | awk ' {s=$1+s} END {print s}'

Daily numbers

Daily and hourly number estimates are quite similar: No cookie requests are 15% of total requests, of those we have 35% with signatures that only appear once, thus pointing to users without cookies. This represents 5.4% of total pageviews, so at least that many are users without cookies. For the bots with very likely user agents we have about 10% of nocookie traffic, which represents 1.5% of total pageviews. So, at least 1.5% of total pageviews is bot traffic. Just like in the hourly data about 50% of nocookie traffic does fall clearly in neither of these categories (clear bots or clear users)

Worklog

  • 2015-12-04. Changed dataset to December 1st as we have deployed recent updates to bot regex that affect this data, pour regex catches more bots via user agent (quite a bit more)

and thus the percentages of nocookie traffic that *seems* to comes from users that are actually bots has decreased.

  • 2015-12-04. Calculated numbers daily on same date than we sourced hourly data