Analytics/Data Lake/Traffic/Pageviews/Redirects

From Wikitech

This page discusses the following question: What happens when a request goes through redirects like

/wiki/Something Notable
/wiki/Something%20Notable
/wiki/Something_Notable

or

/index.php?title=Something Notable
/index.php?title=Something%20Notable
/index.php?title=Something_Notable

And how do we handle pageview identification and counting on those requests?

Types of redirects

In the example above we see 2 kinds of redirects, but there are others, here's a list of possible redirects:

Direct correct request

Well this is not a redirect, but serves as a base to compare it to the other exmples. The browser sends a request to for example Something_Notable and Varnish responds with a 200. The Cluster recognizes this as a Pageview.

URI encodings performed by the browser

Those are made prior to sending the request. For example: Something Notable to Something%20Notable, or "Awesome" to %22Awesome%22. They have no effect in the pageview computation, because both representations are supported in PageviewDefinition UDFs, and are ultimately normalized by it.

Capitalization of the first letter

Whenever a request is sent with a lower-case first letter, the response is a 301, where the target is the article with a capitalized first letter. The browser will send another request to the new target this time, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.

Conversion of spaces

Conversion of spaces (%20) to underscore is the same case as first letter capitalization. Whenever a request is sent with spaces (%20) in between words, the response is a 301, where the target is the article with underscores instead of spaces (%20). The browser will send another request to the new target, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.

Other spellings covered by a redirect page

Any other spellings like: alternate spellings, misspellings, abbreviations, translations, capitalizations, plural-vs-singular, etc. for which there is a page in the corresponding project that acts as a hard redirect (its contents start with #REDIRECT[<target>]) will be handled by Varnish or the server-side and will return a 200 response with the contents of the target page, with a small redirect note like "(Redirected from ...)". However, Varnish will generate a log with the redirect URL (before conversion). This is the only potentially problematic scenario, because the cluster will compute a pageview for the redirect page, even the contents shown to the user are those of the target page. But nevertheless, it will only compute 1 pageview, there will be no duplicates.

Alternate spellings NOT covered by a redirect page

If no page exists that covers the spelling requested, Varnish or the server return a 404, so no pageview will be computed for that.

Potential problems

Per article analyses

The only redirect scenarios that can be confusing (or may be wrong) are the alternate spellings covered by a redirect page. They do not alter global counts, or counts per project, but they alter per article analyses. For example, in the per-article endpoint of the Pageview API, the page "Barack_Obama":

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100

returns 26166 pageviews, whereas its redirect page "Barack_obama" (note the lower-case 'o'):

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100

returns 30415 pageviews for the same period. Actually all the users that generated these 30415 pageviews actually read the contents of Barack_Obama with capital 'O', but we're counting them as Barack_obama (lower-case 'o'). The research paper mentioned by Aaron:

https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf

suggests that 55% of the articles in the main namespace are redirects to other pages, so this is surely not a small proportion of pageviews or articles.

Possible solutions?

X-Analytics

Add a redirectedTo field to the x-analytics header that holds the target url of the redirect. Note: if the request to the redirect page has ?redirect=no, it should leave the redirectedTo field blank. And let the PageviewDefinition get the page title from the redirectedTo field when not empty.