Analytics/AQS/Media metrics

From Wikitech
< Analytics‎ | AQS

The terms media counts, media requests and media views refer to different ways of measuring the number of times that images, videos and sounds from Wikimedia Commons are viewed or played. This page describes the three approaches.

Media counts

Main article: Analytics/Data Lake/Traffic/Mediacounts

The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org has been transferred to users.

See phabricator ticket for long-standing community request: https://phabricator.wikimedia.org/T210313

Media requests

Media requests is the proposed name of the AQS endpoint that will serve the current state of the media requests API. It will feature roughly the same endpoints that the current pageview API has:

  • Monthly aggregates per project
  • Media requests per file
  • Top 1000 files per month/day

Media requests have the same caveats as media counts in that a lot of prefetches and requests to media that don't end up viewed by the user are counted as valid traffic. Therefore this metric has a lot of noise. Nevertheless, it has a historical purpose as any metric that actually gauges whether it has been viewed by a user (media views described below) will be instrumented as an event and will not have the possibility of being backfilled.

Project as a dimension

All of our Wikistats metrics currently have a project dimension. In the case of media requests it's a bit tricky because files do not belong intrinsically to a specific wiki. However, for roughly half of the webrequests to files we can retrieve the project that it was requested from by using the referer string.

This means that instead of a traditional project field, these metrics will have a referer field, that can either be {project}, internal, external, search-engine , unknown and none

Study of signal/noise

As agreed during the 2019 Analytics Team offsite, to make sure that this data is useful we need to check what proportion of requests that come in for media files are pre-fetches versus actual media used by the users. If the team considers that the data has enough value, we'll proceed with the productionization steps described below.

We calculated the proportion of mediarequests that could be prefetches triggered by Media Viewer. The following is a study of a week of data that revealed that, assuming that for each Media Viewer request (a hit to /beacon/media) a preview mediarequest is generated, around 0.7% of all mediarequests are prefetches:

Week of November 11 to 17, 2019
+---+---------------+---+------------------+---------------------+
|day|beacon requests|day|   mediarequests  |percentage_prefetches|
+---+---------------+---+------------------+---------------------+
| 11|   20457718    | 11|        2628829234|   0.7782064249518309|
| 12|   20543908    | 12|        2616581505|    0.785143056340605|
| 13|   19856770    | 13|        2578136050|   0.7701986867605377|
| 14|   19691489    | 14|        2544607143|   0.7738518322629734|
| 15|   18048320    | 15|        2387139372|   0.7560647782738678|
| 16|   14803854    | 16|        2201123452|   0.6725590055636734|
| 17|   16819570    | 17|        2482787406|   0.6774470484002447|
+---+---------------+---+------------------+---------------------+

Steps to put in production

Oozie job

The current mediacounts oozie job loads hourly to the hive mediacounts table and generates a daily public dump. This job needs to be modified to generate the aggregates per project. The regular expression that classifies files into types of media seems to be really out of date and needs to be revamped.

Endpoints in AQS

Full article: Mediarequests API

Cassandra loading

According to SRE's calculations we should be OK regarding storage capacity for these endpoints.

Media requests need to be added to the cassandra bundle. Depending on the decisions we make about dimensionality (see note above about access-type and access-site), loading time should be around 2 weeks. The loading needs to be monitored periodically.

Wikistats UI

Unless we decide to add a new area for media, the only actionable in the Wikistats UI will be to add the three metrics' configuration and decide on their position in the dashboard, if any.

Media views

Media views are the final stage of measuring viewership of media in the wikis. They are the equivalent of pageviews in that they should be a verified measure that a media file has been viewed by a user, either inside an article or in the media player. The discussed approach to this will be to use something like a MediaView event that is triggered when an image, video, or sound is scrolled over or played.

This can't be done with the dataset that the two approaches above use (webrequests) and it would require someone on the MediaWiki side to instrument the code to send these events for us to aggregate them and generate the metrics.

This work will probably be planned for FY2020-2021, but we should start coordinating with Audiences folks to instrument code.