This page discusses the following question: What happens when a request goes through redirects like
/wiki/Something Notable /wiki/Something%20Notable /wiki/Something_Notable
/index.php?title=Something Notable /index.php?title=Something%20Notable /index.php?title=Something_Notable
And how do we handle pageview identification and counting on those requests?
- 1 Types of redirects
- 2 Potential problems
- 3 Possible solutions?
Types of redirects
In the example above we see 2 kinds of redirects, but there are others, here's a list of possible redirects:
Direct correct request
Well this is not a redirect, but serves as a base to compare it to the other exmples. The browser sends a request to for example Something_Notable and Varnish responds with a 200. The Cluster recognizes this as a Pageview.
URI encodings performed by the browser
Those are made prior to sending the request. For example: Something Notable to Something%20Notable, or "Awesome" to %22Awesome%22. They have no effect in the pageview computation, because both representations are supported in PageviewDefinition UDFs, and are ultimately normalized by it.
Capitalization of the first letter
Whenever a request is sent with a lower-case first letter, the response is a 301, where the target is the article with a capitalized first letter. The browser will send another request to the new target this time, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.
Conversion of spaces
Conversion of spaces (%20) to underscore is the same case as first letter capitalization. Whenever a request is sent with spaces (%20) in between words, the response is a 301, where the target is the article with underscores instead of spaces (%20). The browser will send another request to the new target, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.
Other spellings covered by a redirect page
Any other spellings like: alternate spellings, misspellings, abbreviations, translations, capitalizations, plural-vs-singular, etc. for which there is a page in the corresponding project that acts as a hard redirect (its contents start with #REDIRECT[<target>]) will be handled by Varnish or the server-side and will return a 200 response with the contents of the target page, with a small redirect note like "(Redirected from ...)". However, Varnish will generate a log with the redirect URL (before conversion). This is the only potentially problematic scenario, because the cluster will compute a pageview for the redirect page, even the contents shown to the user are those of the target page. But nevertheless, it will only compute 1 pageview, there will be no duplicates.
Alternate spellings NOT covered by a redirect page
If no page exists that covers the spelling requested, Varnish or the server return a 404, so no pageview will be computed for that.
Per article analyses
The only redirect scenarios that can be confusing (or may be wrong) are the alternate spellings covered by a redirect page. They do not alter global counts, or counts per project, but they alter per article analyses. For example, in the per-article endpoint of the Pageview API, the page "Barack_Obama":
returns 26166 pageviews, whereas its redirect page "Barack_obama" (note the lower-case 'o'):
returns 30415 pageviews for the same period. Actually all the users that generated these 30415 pageviews actually read the contents of Barack_Obama with capital 'O', but we're counting them as Barack_obama (lower-case 'o'). The research paper mentioned by Aaron:
suggests that 55% of the articles in the main namespace are redirects to other pages, so this is surely not a small proportion of pageviews or articles.
Add a redirectedTo field to the x-analytics header that holds the target url of the redirect. Note: if the request to the redirect page has ?redirect=no, it should leave the redirectedTo field blank. And let the PageviewDefinition get the page title from the redirectedTo field when not empty.