Analytics/AQS/Wikistats 2/DataQuality/Vetting of mediarequest metrics

From Wikitech

Definition of a mediarequest

A mediarequest is a web request (as contained in the wmf.webrequest hive table) hitting any URI in upload.wikimedia.org. This host is the Wikimedia file storage server and any hit with a 200 or 206 response counts as a mediarequest:

SELECT *
FROM wmf.webrequest
WHERE webrequest_source='upload'
AND year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour}
AND uri_host = 'upload.wikimedia.org'
AND (
  http_status = 200 -- No 304 per RFC discussion
  OR (http_status = 206
    AND SUBSTR(range, 1, 8) = 'bytes=0-'
    AND range != 'bytes=0-0'
  )
)

Differences with mediacounts

This definition should match exactly that of the old mediacounts job. There are three differences between the two datasets:

  • Broader split in referrer types. Mediacounts only splits referring sites in three categories: internal, external, and unknown. Mediarequests not only includes internal, external, search-engine, unknown and null, but also, if the webrequest includes a specific wiki in its referrer string, it will set said wiki as the referer. This gives the Mediarequest dataset a sort of project dimension.
  • Denormalized data. The old Mediacounts dataset is pivoted, meaning that every row compiles a series of counts related to a single file in one hour. Mediarequests features one row, per referer, per hour, per file transcoding. This allows for more complex, exploratory queries than the old form. As an example, it is not possible to know how many hits did the main Wikipedia logo get on external referers, transcoded in a thumbnail size. The Mediarequests dataset, because it is denormalized, does allow for it.
  • Agent type. The Mediarequests dataset includes whether the agent requesting the resource is a user or a spider.

Mediarequest hive dataset

To vet the dataset itself, we're comparing its aggregated numbers with both the Mediacounts and Webrequest values

Query in Mediacounts:

SELECT day, SUM(total) FROM wmf.mediacounts WHERE year = 2019 AND month=9 GROUP BY day SORT BY day ASC LIMIT 50;

Query in Mediarequest:

SELECT day, SUM(request_count) FROM wmf.mediarequest WHERE year = 2019 AND month=9 GROUP BY day SORT BY day ASC LIMIT 50;

Query in Webrequest:

SELECT day, count(*) FROM wmf.webrequest
WHERE webrequest_source='upload'
    AND uri_host = 'upload.wikimedia.org'
    AND (
        http_status = 200 -- No 304 per RFC discussion
        OR (http_status = 206
            AND SUBSTR(range, 1, 8) = 'bytes=0-'
            AND range != 'bytes=0-0'
        )
    )
    AND year = 2019 AND month=9
GROUP BY day SORT BY day ASC LIMIT 50;

Aggregated data results in Google Sheets

Conclusion

There is no statistically relevant difference between the aggregated values in mediacounts and mediarequest. Differences between mediarequests and the query in webrequest are higher but none with a difference per day higher than 0.02%

Mediarequests aggregated per referer

Values obtained from the AQS endpoint were compared to those obtained by the query used in the previous section.

AQS url:

https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/2019090100/2019100100

Mediarequests per file

10 files were chosen randomly from the top 1000 and compared their values in Mediacounts for September 1st 2019.

Janky hive query for mediacounts:

SELECT base_name, sum(total)
FROM wmf.mediacounts
WHERE year = 2019 AND month = 9 and day = 1 AND (
    base_name = '/wikipedia/commons/0/03/Flag_of_Italy.svg'
    OR base_name = '/wikipedia/donate/a/ac/Close_oojs.png'
    OR base_name = '/wikipedia/commons/f/ff/Wikidata-logo.svg'
    OR base_name = '/wikipedia/commons/5/51/Spherical_Coordinates_(Colatitude%2C_Longitude).svg'
    OR base_name = '/favicon.ico'
    OR base_name = '/wikipedia/commons/8/8a/Loudspeaker.svg'
    OR base_name = '/wikipedia/commons/7/74/Red_Pencil_Icon.png'
    OR base_name = '/wikipedia/commons/f/f3/Flag_of_Switzerland.svg'
    OR base_name = '/wikipedia/commons/e/ea/Picto_infobox_cinema.png'
    OR base_name = '/wikipedia/en/0/03/Flag_of_Italy.svg'
)
GROUP BY base_name;

AQS URLS:

File AQS URL
/Picto_infobox_cinema.png https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F03%2FFlag_of_Italy.svg/daily/2019090100/2019090100
/favicon.ico https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fdonate%2Fa%2Fac%2FClose_oojs.png/daily/2019090100/2019090100
/Spherical_Coordinates_(Colatitude%2C_Longitude).svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Ff%2Fff%2FWikidata-logo.svg/daily/2019090100/2019090100
/Loudspeaker.svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F5%2F51%2FSpherical_Coordinates_%28Colatitude%252C_Longitude%29.svg/daily/2019090100/2019090100
/Flag_of_Italy.svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Ffavicon.ico/daily/2019090100/2019090100
/Flag_of_Switzerland.svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F8%2F8a%2FLoudspeaker.svg/daily/2019090100/2019090100
/Flag_of_Italy.svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F7%2F74%2FRed_Pencil_Icon.png/daily/2019090100/2019090100
/Red_Pencil_Icon.png https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Ff%2Ff3%2FFlag_of_Switzerland.svg/daily/2019090100/2019090100
/Wikidata-logo.svg https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2Fe%2Fea%2FPicto_infobox_cinema.png/daily/2019090100/2019090100
/Close_oojs.png https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fen%2F0%2F03%2FFlag_of_Italy.svg/daily/2019090100/2019090100

Data results in Google Sheets

The 10 files queried matched exactly their counterparts in mediacounts:

Top media files by request count

TBA