Data Platform/Data Lake/Traffic/Pageviews/Redirects
This page discusses the following question: What happens when a request goes through redirects like
/wiki/Something Notable /wiki/Something%20Notable /wiki/Something_Notable
or
/index.php?title=Something Notable /index.php?title=Something%20Notable /index.php?title=Something_Notable
And how do we handle pageview identification and counting on those requests?
Types of redirects
In the example above we see 2 kinds of redirects, but there are others, here's a list of possible redirects (the pageview definition currently defines 301s, 302s, and 307s as valid redirects):
Direct correct request
Well this is not a redirect, but serves as a base to compare it to the other examples. The browser sends a request to for example Something_Notable and the caches responds with a 200. The Cluster recognizes this as a Pageview.
In some cases the user's cache has the latest version, so our caches will respond with a 304 response (no change).
URI encodings performed by the browser
Our caching layer tries to establish a canonical URL by automatically decoding and replacing some characters. Full details are available at URL_path_normalization. For example, spaces are transformed to underscores: if a request for Something Notable comes in, the cache layer responds with a 301 that redirects to Something_Notable (same with "Awesome" to %22Awesome%22).
The webrequest records sent to Kafka by the caches include both the 301 and ultimate 200 responses. The 200s will be identified as pageviews and the 301s will show up as "redirect to pageview"s. See isRedirectToPageview in the pageview defition and how it's used by the job that refines webrequest records.
Capitalization of the first letter
For the same reasons as above, and also handled by the caches and URI normalization, whenever a request is sent with a lower-case first letter the response is a 301, where the target is the article with a capitalized first letter. The browser will send another request to the new target this time, which either ends up in a 200 if that page exists, or a 404. The PageviewDefinition identifies 301 requests as "redirects to pageview"s, but will only count the second request to the correct page as a pageview.
Conversion of spaces
Same as above, see URI encodings
Other titles covered by a redirect page
Other spellings like: alternate spellings, misspellings, abbreviations, translations, capitalizations, plural-vs-singular, etc. can have a corresponding page in the project that acts as a hard redirect (its contents start with #REDIRECT[<target>]). Requests to these titles will be handled by MediaWiki as follows (see MW code).
MediaWiki will find the target of the redirect page, resolving redirect chains and generally trying to do the right thing, and ultimately render the text of the intended target of the redirect, with a small prefix note like "(Redirected from ...)". This will be cached in Varnish as the correct 200 response under the title of the redirect page. Therefore the webrequest record generated will be a 200 with the redirect title, and all downstream pageview datasets will use that.
To account for this, in the pageviews tool, a user can select "include redirects" - this will find all hard redirects and the add the view counts associated with them to the page being displayed.
Alternate spellings NOT covered by a redirect page
If no page exists that covers the spelling requested, Varnish or the server return a 404, so no pageview will be computed for that.
Potential problems
Confusion about client-side instrumentation and the correct title to use
When an instrument sends back a full or hashed location.pathname or location.search, given all the complexity above, the question is: what would the uri_path or uri_query be in the corresponding webrequest record? The answer is that, in all cases, it seems the caches will record a 200 or 304 status response with the same title that would be available to the browser from the location object. Client-side javascript will update the URI to reflect this.
However, in some cases, the client can not run Javascript. It seems in those cases the title would not match the corresponding webrequest (but we would obviously not see JS client-side instrumentation in those cases). It may be possible to dig further into this by instrumenting redirect responses with beacon pixels and looking for this case specifically.
Per article analyses
The only redirect scenarios that can be confusing (or may be wrong) are the alternate spellings covered by a redirect page. They do not alter global counts, or counts per project, but they alter per article analyses. For example, in the per-article endpoint of the Pageview API, the page "Barack_Obama":
returns 26166 pageviews, whereas its redirect page "Barack_obama" (note the lower-case 'o'):
returns 30415 pageviews for the same period. Actually all the users that generated these 30415 pageviews actually read the contents of Barack_Obama with capital 'O', but we're counting them as Barack_obama (lower-case 'o'). The research paper mentioned by Aaron:
https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf
suggests that 55% of the articles in the main namespace are redirects to other pages, so this is surely not a small proportion of pageviews or articles.
Possible solutions?
X-Analytics
Add a redirectedTo field to the x-analytics header that holds the target url of the redirect. Note: if the request to the redirect page has ?redirect=no, it should leave the redirectedTo field blank. And let the PageviewDefinition get the page title from the redirectedTo field when not empty.