Analytics/Data Lake/Traffic/Unique Devices/Automated traffic correction

From Wikitech

Since June 2020[1] the unique-devices metric computation has been updated to take advantage from the automated label that was recently added to pageview data (see Analytics/Data_Lake/Traffic/BotDetection). Better detection of bot spam or bot vandalism leads to metrics of better quality as those "bad" or "spammy" pageviews are removed from our data. Besides that we are also computing the metric on top of an intermediate dataset Analytics/Data Lake/Traffic/Pageview_actor which leads to computations being a lot faster as we do not need to scan the ever-growing webrequest table.

  1. beginning of the month for monthly data, 24th for daily data

Impact summary

Big Picture

TL,DR: Differences are not notable

Graphs below present daily and monthly numbers, per-domain, and per-project-families, split by countries and by domain/project-family. For monthly data, the last 12 months are shown, the last month being computed with the new version. For daily data the last 30 days are shown, last 13 days being computed with the new version.

The only graph showing a notable difference is the per-domain monthly countries split, presenting a visible drop for the last month. Things to notice are that the relative drop is actually bigger for the month before, and that at a larger time-frame (last 3 years) this drop is non-significative.

Per domain

Monthly


Daily

Per project family

Monthly


Daily

Detailed analysis

On 2020-06-01, differences between unique-devices computed originally (no traffic flagged as automated and original actor-signature), and unique -devices computed using the new data and algorithm (removal of automated traffic and better actor-signature).

Note: unique-devices computation is expensive so the analysis has been made for daily data only, not monthly data.

Per domain

The original unique-devices version generated values for 80401 domain/country pairs, and the new one for 80392 (9 pairs removed with very small numbers - less than 10 unique-devices). Over the pairs found by both versions, there are 1535 pairs for which the unique-devices found values differ (1.9 percent). Over those 1535 pairs, 1001 are from values having at least 1000 unique-devices (less than 1000 has proven too variable to be of strong interest), and 182 have values bigger than 1000 and show a variation of more than 0.5% between original and new computation version (182 / 80392 = 0.2%).

The one witnessed change from the original to new version is the one expected: a decrease of unique-devices for certain domain/country pairs, linked to removal of automated traffic.

Interesting details:

  • The domain/country pair showing the biggest absolute change is en.wikipedia.org / US, showing a loss of 167712 uniques-devices. This big absolute value is actually a small proportion of the total number of unique-devices for that domain/country pair: 1.73 percent.
  • wikidata.org is a domain loosing quite some unique-devices accross countries (expected given the nature of the data).

Per project family

The original version generated values for 2443 project-family/country pairs, and the new one for 2481. The ratio of pairs showing significative difference is a lot higher for the per-project-family dataset: 1981 pairs show a difference (81%), among which 539 have values bigger than 1000 and an absolute difference bigger than 0.5% (539 / 2443 = 22%).

Interestingly, the changes in this dataset are more complex from the one in the per-domain one. In addition to the drop of unique-devices for certain pairs due to removal of automated traffic (~10% of the significant differences), the vast majority of differences are an accrual of the unique-devices for project-family/country pairs. The increase is due to the change of actor-signature, the new version providing a more accurate distinction within actors (see next section for details).

Interesting details:

  • The project-family/country pair showing the biggest absolute change is wikipedia / US, showing a decrease of 236312 uniques-devices due to automated traffic removal. This big absolute value is actually a small proportion of the total number of unique-devices for that project-family/country pair: 0.77 percent.


Technical details

Automated traffic removal

The unique devices metric is reported with two different numbers: "uniques_offset" and "uniques_underestimate". For details you can take a look at Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#How_we_are_counting:_Plain_English the docs. The removal of traffic labelled as "automated" has an impact over the uniques_underestimate part of the unique-devices value as it would flag as "automated" actors having made a big number of requests in a short time period and those will not be included in the computation of the metric.

Actor signature change

There are two main aspects to the change:

  • Better identification of a Wikipedia-app (relatively small impact, the number of apps being a lot smaller than the number of other devices)
  • Change of hashing function to identify an "actor" from a simple hash (string to integer) to md5 hash (string to string). This part of the change is the one making impact. It allows for a better discrimination of actors by reducing collisions significantly. When the uniques_underestimate value is relatively small (less than 100k) then the collisions are less frequent and the quality gain is not as big. But when the project space is bigger, the gain is quite significative.