Data Platform/AQS/Wikistats 2/DataQuality/Vetting of mediarequest metrics
Definition of a mediarequest
A mediarequest is a web request (as contained in the wmf.webrequest hive table) hitting any URI in upload.wikimedia.org. This host is the Wikimedia file storage server and any hit with a 200 or 206 response counts as a mediarequest:
SELECT * FROM wmf.webrequest WHERE webrequest_source='upload' AND year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} AND uri_host = 'upload.wikimedia.org' AND ( http_status = 200 -- No 304 per RFC discussion OR (http_status = 206 AND SUBSTR(range, 1, 8) = 'bytes=0-' AND range != 'bytes=0-0' ) )
Differences with mediacounts
This definition should match exactly that of the old mediacounts job. There are three differences between the two datasets:
- Broader split in referrer types. Mediacounts only splits referring sites in three categories: internal, external, and unknown. Mediarequests not only includes internal, external, search-engine, unknown and null, but also, if the webrequest includes a specific wiki in its referrer string, it will set said wiki as the referer. This gives the Mediarequest dataset a sort of project dimension.
- Denormalized data. The old Mediacounts dataset is pivoted, meaning that every row compiles a series of counts related to a single file in one hour. Mediarequests features one row, per referer, per hour, per file transcoding. This allows for more complex, exploratory queries than the old form. As an example, it is not possible to know how many hits did the main Wikipedia logo get on external referers, transcoded in a thumbnail size. The Mediarequests dataset, because it is denormalized, does allow for it.
- Agent type. The Mediarequests dataset includes whether the agent requesting the resource is a user or a spider.
Mediarequest hive dataset
To vet the dataset itself, we're comparing its aggregated numbers with both the Mediacounts and Webrequest values
Query in Mediacounts:
SELECT day, SUM(total) FROM wmf.mediacounts WHERE year = 2019 AND month=9 GROUP BY day SORT BY day ASC LIMIT 50;
Query in Mediarequest:
SELECT day, SUM(request_count) FROM wmf.mediarequest WHERE year = 2019 AND month=9 GROUP BY day SORT BY day ASC LIMIT 50;
Query in Webrequest:
SELECT day, count(*) FROM wmf.webrequest WHERE webrequest_source='upload' AND uri_host = 'upload.wikimedia.org' AND ( http_status = 200 -- No 304 per RFC discussion OR (http_status = 206 AND SUBSTR(range, 1, 8) = 'bytes=0-' AND range != 'bytes=0-0' ) ) AND year = 2019 AND month=9 GROUP BY day SORT BY day ASC LIMIT 50;
Aggregated data results in Google Sheets
Conclusion
There is no statistically relevant difference between the aggregated values in mediacounts and mediarequest. Differences between mediarequests and the query in webrequest are higher but none with a difference per day higher than 0.02%
Mediarequests aggregated per referer
Values obtained from the AQS endpoint were compared to those obtained by the query used in the previous section.
AQS url:
https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/2019090100/2019100100
Mediarequests per file
10 files were chosen randomly from the top 1000 and compared their values in Mediacounts for September 1st 2019.
Janky hive query for mediacounts:
SELECT base_name, sum(total) FROM wmf.mediacounts WHERE year = 2019 AND month = 9 and day = 1 AND ( base_name = '/wikipedia/commons/0/03/Flag_of_Italy.svg' OR base_name = '/wikipedia/donate/a/ac/Close_oojs.png' OR base_name = '/wikipedia/commons/f/ff/Wikidata-logo.svg' OR base_name = '/wikipedia/commons/5/51/Spherical_Coordinates_(Colatitude%2C_Longitude).svg' OR base_name = '/favicon.ico' OR base_name = '/wikipedia/commons/8/8a/Loudspeaker.svg' OR base_name = '/wikipedia/commons/7/74/Red_Pencil_Icon.png' OR base_name = '/wikipedia/commons/f/f3/Flag_of_Switzerland.svg' OR base_name = '/wikipedia/commons/e/ea/Picto_infobox_cinema.png' OR base_name = '/wikipedia/en/0/03/Flag_of_Italy.svg' ) GROUP BY base_name;
AQS URLS:
The 10 files queried matched exactly their counterparts in mediacounts:
Top media files by request count
TBA