Analytics/Data Lake/Traffic/BotDetection

From Wikitech
Jump to navigation Jump to search

Why do we need more sophisticated bot detection

Wikipedia's content is read by humans and also by automated agents, which are really scripts with different levels of ability. These automated scripts (normally called 'bots') could be as complicated as the google crawler that "reads" wikipedia and indexes it all. But also as simple as someone writing a small script to scrape couple pages of interest. Also, we see bot vandalism, scripts that, for example, try to pull pages of sexual nature to the wikipedia's top read list for a smaller site. Up to recently we only marked as bot traffic requests in which bots self-identify as such via the user agent (one of the HTTP headers) using the word 'bot', this traffic we labeled as 'spider'. Now, this left us with a significant amount of traffic that while "automated" in nature it was not marked as such. The biggest problem with miss-labeling the traffic is not the overall effect on the pageview metric but rather the effect on any kind of top-pageview list. For example, see recent event with United states Senate page, it appeared (in the midst of covid19) as the most visited page for English Wikipedia. For a couple years now the community has been applying filtering rules for any "Top X" list that we compile[1], we have incorporated some of these filters to our automated traffic detection.

We feel is important to mention that we are not aiming to remove every single instance of bot traffic but rather remove the most poignant cases of automated traffic not self identified as such.

From our analysis (detailed below) the overall pageviews coming from users drop about 5% when we mark 'automated' traffic as such, this number gets as high as 8% in some instances but mostly lingers around 5.5%. This effect is not equally distributed across sites.

Effect of automated-traffic detection on Pageview metric - March 2020

The automated-traffic detection method changes the agent_type field of some pageviews from user to automated. Therefore the number of user pageviews will decrease when the mechanism will be deployed.

Effect on most-viewed pages list (top)

Where the bot detection changes are most visible is in any list that compiles top pageviews (top pageviews per per project or per project and country). There are two reasons why the bot detection code impacts these lists significantly: the first one is 'bot vandalism' the second one is 'bot spam'. 'Bot vandals' are bots whose only goal seems to add obscene, sexual or political content to the top pageview list for a given project. We had a recent instance of bot vandalism on Hungarian Wikipedia where a significant percentage of the pages on the top pageview list were just bogus titles. We use data such as the one for this event to verify the accuracy of our removal, in this particular incident was very effective. There have been other incidents of lesser volume in terms of pageviews where it is harder to detect that the traffic might be automated.

The second most common effect we see on top pageview lists is 'bot spam'. Some bots request a given page over and over with unknown intent, in some instances it seems the bot is trying to manipulate one of Wikipedias top pageview lists to gain popularity for a topic. See, for example, a recent example in German Wikipedia. Normally the bot spam is temporary and the page soon disappears from the top pageview list. However, there are pages with sustained automated traffic since years back. The page of Darth Vader top list for English Wikipedia is a good example of one of those.

Global Impact - All wikimedia projects

Over the month of March 2020 for all projects, the number of pageviews by agent_type is as follow:

agent_type sum(view_count) %
user 16202727427 71.55%
spider 5188363262 22.91%
automated 1253707204 5.54%

For the 5.54% of traffic labelled as automated traffic, 84% of traffic is desktop, 16% is mobile web and less than half a percent is from the mobile app. This is consistent with the heuristics the community has been using to discard automated traffic when manually curating top pageview lists, it is been a few years that pages with a very high percentage of desktop-only traffic are no longer present on those lists.

The graphs below show that the number of traffic as automated is stable, also among access_method.


Totals

Pageviews for all wikimedia projects by agent-type in March 2020
Pageviews for all wikimedia projects by agent-type (ratios) in March 2020


By access-method

Desktop pageviews for all wikimedia projects by agent-type in March 2020
Mobile-web Pageviews for all wikimedia projects by agent-type in March 2020
Mobile-app pageviews for all wikimedia projects by agent-type in March 2020

Impact per project

The impact per project depends on the size of projects.

  • On big projects the volume of traffic flagged as automated is relatively regular, similar to the global one (see graphs below).
  • On smaller projects however the amount of so called automated-traffic is a lot less regular, with spikes or plateau periods. It is very interesting to notice that the removal of the automated traffic from the user bucket makes the new user-traffic not only somehow smaller but also a lot more stable in many cases (better split between signal and noise).

The number associated with each project is its rank in term of number of pageviews for the month of March 2020, user, spider and automated included.


Top 10 projects by number of views

en.wikipedia (1)

Pageviews for en.wikipedia project by agent-type in March 2020
Pageviews for en.wikipedia project by agent-type (ratio) in March 2020


es.wikipedia (2)

Pageviews for es.wikipedia project by agent-type in March 2020
Pageviews for es.wikipedia project by agent-type (ratio) in March 2020


ja.wikipedia (3)

Pageviews for ja.wikipedia project by agent-type in March 2020
Pageviews for ja.wikipedia project by agent-type (ratio) in March 2020


de.wikipedia (4)

Pageviews for de.wikipedia project by agent-type in March 2020
Pageviews for de.wikipedia project by agent-type (ratio) in March 2020


commons.wikimedia (5)

Pageviews for commons.wikimedia project by agent-type in March 2020
Pageviews for commons.wikimedia project by agent-type (ratio) in March 2020


ru.wikipedia (6)

Pageviews for ru.wikipedia project by agent-type in March 2020
Pageviews for ru.wikipedia project by agent-type (ratio) in March 2020


fr.wikipedia (7)

Pageviews for fr.wikipedia project by agent-type in March 2020
Pageviews for fr.wikipedia project by agent-type (ratio) in March 2020


it.wikipedia (8)

Pageviews for it.wikipedia project by agent-type in March 2020
Pageviews for it.wikipedia project by agent-type (ratio) in March 2020


zh.wikipedia (9)

Pageviews for zh.wikipedia project by agent-type in March 2020
Pageviews for zh.wikipedia project by agent-type (ratio) in March 2020


pt.wikipedia (10)

Pageviews for pt.wikipedia project by agent-type in March 2020
Pageviews for pt.wikipedia project by agent-type (ratio) in March 2020


5 random Smaller projects

gl.wikipedia (80)

Pageviews for gl.wikipedia project by agent-type in March 2020
Pageviews for gl.wikipedia project by agent-type (ratio) in March 2020


en.wikivoyage (81)

Pageviews for en.wikivoyage project by agent-type in March 2020
Pageviews for en.wikivoyage project by agent-type (ratio) in March 2020


wikisource (158)

Pageviews for wikisource project by agent-type in March 2020
Pageviews for wikisource project by agent-type (ratio) in March 2020


de.wikiquote (259)

Pageviews for de.wikiquote project by agent-type in March 2020
Pageviews for de.wikiquote project by agent-type (ratio) in March 2020


el.wikiquote (473)

Pageviews for el.wikiquote project by agent-type in March 2020
Pageviews for el.wikiquote project by agent-type (ratio) in March 2020

Code

Because the detection of automated traffic happens after pageviews are refined the 'automated' marker on the column 'agent_type' is not present on the webrequest table, rather the "automated' marker will be present on the pageview_hourly table. So, records marked initially with agent_type='user' on webrequest might be marked at later time with agent_type='automated'. In order to know where a particular set of requests from as given actor in webrequest is marked as 'automated' it is necessary to calculate the actor signature. Predictions per actor signature are calculated daily. Sample code to calculate actor signatures looks as follows:

```|

-- Example 1: I am wondering about this IP and whether its traffic might be of automated nature
use wmf;
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION get_actor_signature AS 'org.wikimedia.analytics.refinery.hive.GetActorSignatureUDF';
with questionable_requests as (
   select
       distinct get_actor_signature(ip, user_agent, accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature
   from webrequest
   where
       year=2020 and day=20 and hour=1 and month=4
       and is_pageview=1
       and pageview_info['project'] ='es.wikipedia'
       and agent_type="user"
       and IP='some'
       limit 100
)
select
   label,
   label_reason, // this field has an explanation of why the label
   AL.actor_signature
from questionable_requests QR join predictions.actor_label_hourly AL
   on (AL.actor_signature = QR.actor_signature)
where
   year=2020 and day=20 and hour=1 and month=4
-- Example2: For an hour of webrequest select just the traffic that is not labeled as automated
use wmf;
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION get_actor_signature AS 'org.wikimedia.analytics.refinery.hive.GetActorSignatureUDF';
with user_traffic as (
   select actor_signature
   from predictions.actor_label_hourly
   where year=2020 and month=4 and day=20 and hour=1
   and label="user"
)
   select
       pageview_info['page_title'],
       get_actor_signature(ip, user_agent, accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature
   from webrequest W
       join user_traffic UT on (UT.actor_signature = get_actor_signature(ip,user_agent,accept_language,uri_host,uri_query,x_analytics_map))
   where
       year=2020 and day=20 and hour=1 and month=4
       and is_pageview=1
       and pageview_info['project'] ='es.wikipedia'
       and agent_type="user"

```

  1. https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions