Analytics/Data/Mediacounts

From Wikitech
Jump to: navigation, search

The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org has been transferred to users.

WMF currently does not have infrastructure to provide perfect media counts, and the current media counts implementation has several short comings. But since the community has been waiting for ages already to see any media counts, we publish this non-perfect data nonetheless to get data out until WMF has infrastructure to produce perfect media counts.

Rationales, and motivations for this stream can be found in the corresponding RfC.

This stream is owned by the Analytics Team.

Contained data

Selected requests

The stream contains all requests from the upload cache group that have

  • HTTP status code 200 (OK), or
  • HTTP status code 206 (Partial Content) and a Range header that starts in bytes=0-, but is not bytes=0-0.

The first condition matches the plain fetches of image, movie and audio files. The second condition matches beginnings of streamed media.

Corner cases

  • After some discussion with stake-holders (some parts in on-wiki, most parts in emails), requests with HTTP status code 304 (Not modified) do not get counted at this point, as more interest seems to be on media transfers than media requests. Ideally, it would be media consumption or media views, but there is currently no way to detect that easily from the logs.
  • When consuming streamed media and jumping back to the beginning of the file after having watched part of the file, counts as a new transfer.
  • When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not.

Fields

The stream consists of the following fields

Field # Name Description
1 base_name The name of the raw, original file without the leading https?://upload.wikimedia.org

So for example for each of

, the base_name is /wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg.

base_name is normalized, and has special characters percent escaped.

For images from Commons, you can get the file's page by replacing the first four path segments of the base_name by https://commons.wikimedia.org/wiki/File:. So for the above basename, the file's page on Commons is https://commons.wikimedia.org/wiki/File:Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg .

2 total_response_size Total number of response bytes sent to the users for that file (and its transcodings).
3 total Total number of transfers (counting both transfers of the raw, original and tiny thumbs as 1).
4 original Total number of transfers of the raw, original file (transcodings, thumbs and the like are not counted here). Note, this includes JPG images embedded in pages without the thumb parameter or equivalent, as well as the "thumbnails" asked at a resolution equal or higher than the original image's resolution: in both cases in the original image is embedded directly (and downloaded upon visiting the page), rather than generating a derivative image. See example.
5 transcoded_audio Total number of transfers of a file that got transcoded to an audio file. So for example when a FLAC file is requested as OGG file, the request is counted in this column. (Transfers for the raw, original FLAC file, would get counted in the original column.
6 n/a Reserved for future use.
7 n/a Reserved for future use.
8 transcoded_image Total number of transfers of a file that got transcoded to an image file. So for example when a WebM file, or a GIF file is requested as JPG file, the request is counted in this column. Note, this seems to include (all?) thumbnails as well: the value is higher than 0 also for jpg images, which are rescaled to jpg rather than converted to other formats. (Transfers for the raw, original WebM, or the raw, original GIF file, would get counted in the original column.)
9 transcoded_image_0_199 Total number of transfers of a file that got transcoded to an image file, where 0 <= width <= 199. (This is a drill-down of the transcoded_image column.)
10 transcoded_image_200_399 Total number of transfers of a file that got transcoded to an image file, where 200 <= width <= 399. (This is a drill-down of the transcoded_image column.)
11 transcoded_image_400_599 Total number of transfers of a file that got transcoded to an image file, where 400 <= width <= 599. (This is a drill-down of the transcoded_image column.)
12 transcoded_image_600_799 Total number of transfers of a file that got transcoded to an image file, where 600 <= width <= 799. (This is a drill-down of the transcoded_image column.)
13 transcoded_image_800_999 Total number of transfers of a file that got transcoded to an image file, where 800 <= width <= 999. (This is a drill-down of the transcoded_image column.)
14 transcoded_image_1000 Total number of transfers of a file that got transcoded to an image file, where 1000 <= width. (This is a drill-down of the transcoded_image column.)
15 n/a Reserved for future use.
16 n/a Reserved for future use.
17 transcoded_movie Total number of transfers of a file that got transcoded to a movie file. So for example when a WebM file is requested as OGV file, the request is counted in this column. (Transfers for the raw, original WebM file, would get counted in the original column.)
18 transcoded_movie_0_239 Total number of transfers of a file that got transcoded to a movie file, where 0 <= height <= 239. (This is a drill-down of the transcoded_movie column.)
19 transcoded_movie_240_479 Total number of transfers of a file that got transcoded to a movie file, where 240 <= height <= 479. (This is a drill-down of the transcoded_movie column.)
20 transcoded_movie_480 Total number of transfers of a file that got transcoded to a movie file, where 480 <= height. (This is a drill-down of the transcoded_movie column.)
21 n/a Reserved for future use.
22 n/a Reserved for future use.
23 referer_internal Total number of transfers with a Referer from a WMF domain.
24 referer_external Total number of transfers with a Referer from a non-WMF domain.
25 referer_external Total number of transfers with an empty or invalid Referer.

Availability

dumps.wikimedia.org

The stream is available as daily TSV files at http://dumps.wikimedia.org/other/mediacounts/ and http://wikimedia.crc.nd.edu/other/mediacounts/ .

stat1002.eqiad.wmnet

The stream is available as daily TSV files at /mnt/hdfs/wmf/data/archive/mediacounts on stat1002.

Analytics cluster

The stream is available as daily TSV files at /wmf/data/archive/mediacounts in the Analytics cluster.

In addition to those files, the data is also available at hourly granularity in Parquet format at /wmf/data/wmf/mediacounts, which is accessible in Hive through the wmf.mediacounts table.

Clients

API

Events and known problems since 2015-01-01

Date from Date until Bug Details