Jump to content

Data Platform/AQS/Mediarequests/Limitations

From Wikitech

Data limitations

  • The ability of splitting and filtering by referrer is limited to data from May 2019 onward. Before that, referrer is only split in internal, external, and unknown.
  • The beginning of mediarequest data is the 1st of January 2015.
  • The ability of splitting and filtering by agent type (user, spider) is limited to data from May 2019 onward.
  • About 0.7% of mediarequests are prefetches coming from Media Viewer (more details in Analytics/AQS/Media_metrics)

Issues with file paths

Because these metrics provide usage metrics for static files, there are challenges related to how to store file paths, and how to query data associated to them.

File: URLs vs upload.wikimedia.org URIs

The way we obtain these numbers is by aggregating web requests hitting files hosted in upload.wikimedia.org. Each of these files has a unique URI path, like this:

https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg

The issue here is that this path is different from the one a user would see in their browser. In this case, a user will probably know this image file as:

https://commons.wikimedia.org/wiki/File:Flag_of_Argentina.svg

There are a few problems with this:

  • Every upload.wikimedia.org path has a string like 1/1a in the example above. This is the first, and the first+second positions in the md5 hash string of the file name. While the pair is easy to obtain, it is far from easy for a user to reach this path from the user-friendly one.
  • The wiki project part of the URI in the upload.wikimedia.org path (wikipedia/commons in the example above) doesn't necessarily match the wiki project that the file was uploaded from. A user might think that because they uploaded a file in English Wikipedia, the file will be prefixed by /wikipedia/en, but for a long time all files uploaded, regardless of project family or language, were stored as wikipedia/commons , so this part of the path is not a reliable indicator.
  • Additionally, File: URLs are mirrored across wikis, as illustrated by these three examples:

https://es.wikipedia.org/wiki/Archivo:Flag_of_Argentina.svg

https://en.wikisource.org/wiki/File:Flag_of_Argentina.svg

https://ja.wikipedia.org/wiki/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB:Flag_of_Argentina.svg

All of these links point to the same upload.wikimedia.org file displayed above.

These complications make it difficult for us to use the user-friendly path format as the way to query mediarequests numbers. There is a task in Phabricator discussing possible alternatives to using upload.wikimedia.org paths, as well as a code review in Gerrit to make querying individual files a bit more flexible, but as of April 2020 only upload.wikimedia.org paths are allowed in the API endpoint.

Varying URL encodings

For reference when debugging issues, here's a list of all the different ways that URI paths are encoded in, using a tricky example to illustrate them:

Browser path

https://bar.wikipedia.org/wiki/Datei:Astérix_%26_Obélix_Bruxelles_rue_de_la_Buanderie.jpg

A photo uploaded to Bavarian Wikipedia. The wiki page is prefixed with the namespace Datei:, which is the equivalent of File: in barwiki. The file name contains two diacritics (in Astérix and Obélix), which are not URL encoded, and an ampersand (& sign), which is URL encoded as %26.

Path of file as queried to upload.wikimedia.org

https://upload.wikimedia.org/wikipedia/bar/9/93/Astérix_%26_Obélix_Bruxelles_rue_de_la_Buanderie.jpg

By default, wikis will point to the name of the file with the same format as the browser path above, but will also admit variations with the diacritics encoded or with everything unencoded.

Path of file stored in the mediarequest dataset

/wikipedia/bar/9/93/Ast%C3%A9rix_&_Ob%C3%A9lix_Bruxelles_rue_de_la_Buanderie.jpg

This is the path as it comes from Varnish, stored in the Webrequest dataset, and then in the media request dataset. Here the diacritics are encoded, but not the ampersand. The rule here is: diacritics and non-roman characters are encoded, but the following characters aren't: ($)=`%*!&+:?"\';@^

Path of file as queried for statistics

%2Fwikipedia%2Fbar%2F9%2F93%2FAst%C3%A9rix_%26_Ob%C3%A9lix_Bruxelles_rue_de_la_Buanderie.jpg

Using the media requests endpoint requires completely urlencoding the path so that it can be contained in the AQS URL. The decoded version of this path should always match the browser path described above.