- The ability of splitting and filtering by referrer is limited to data from May 2019 onward. Before that, referrer is only split in internal, external, and unknown.
- The beginning of mediarequest data is the 1st of January 2015.
- The ability of splitting and filtering by agent type (user, spider) is limited to data from May 2019 onward.
- About 0.7% of mediarequests are prefetches coming from Media Viewer (more details in Analytics/AQS/Media_metrics)
Issues with file paths
Because these metrics provide usage metrics for static files, there are challenges related to how to store file paths, and how to query data associated to them.
File: URLs vs upload.wikimedia.org URIs
The way we obtain these numbers is by aggregating web requests hitting files hosted in upload.wikimedia.org. Each of these files has a unique URI path, like this:
The issue here is that this path is different from the one a user would see in their browser. In this case, a user will probably know this image file as:
There are a few problems with this:
- Every upload.wikimedia.org path has a string like
1/1ain the example above. This is the first, and the first+second positions in the md5 hash string of the file name. While the pair is easy to obtain, it is far from easy for a user to reach this path from the user-friendly one.
- The wiki project part of the URI in the upload.wikimedia.org path (
wikipedia/commonsin the example above) doesn't necessarily match the wiki project that the file was uploaded from. A user might think that because they uploaded a file in English Wikipedia, the file will be prefixed by
/wikipedia/en, but for a long time all files uploaded, regardless of project family or language, were stored as
wikipedia/commons, so this part of the path is not a reliable indicator.
- Additionally, File: URLs are mirrored across wikis, as illustrated by these three examples:
All of these links point to the same upload.wikimedia.org file displayed above.
These complications make it difficult for us to use the user-friendly path format as the way to query mediarequests numbers. There is a task in Phabricator discussing possible alternatives to using
upload.wikimedia.org paths, as well as a code review in Gerrit to make querying individual files a bit more flexible, but as of April 2020 only
upload.wikimedia.org paths are allowed in the API endpoint.
Varying URL encodings
For reference when debugging issues, here's a list of all the different ways that URI paths are encoded in, using a tricky example to illustrate them:
A photo uploaded to Bavarian Wikipedia. The wiki page is prefixed with the namespace
Datei:, which is the equivalent of
File: in barwiki. The file name contains two diacritics (in Astérix and Obélix), which are not URL encoded, and an ampersand (& sign), which is URL encoded as
Path of file as queried to upload.wikimedia.org
By default, wikis will point to the name of the file with the same format as the browser path above, but will also admit variations with the diacritics encoded or with everything unencoded.
Path of file stored in the mediarequest dataset
This is the path as it comes from Varnish, stored in the Webrequest dataset, and then in the media request dataset. Here the diacritics are encoded, but not the ampersand. The rule here is: diacritics and non-roman characters are encoded, but the following characters aren't:
Path of file as queried for statistics
Using the media requests endpoint requires completely urlencoding the path so that it can be contained in the AQS URL. The decoded version of this path should always match the browser path described above.