Data Platform/Data Lake/Traffic/Mediacounts

The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org has been transferred to users.

WMF currently does not have infrastructure to provide perfect media counts, and the current media counts implementation has several shortcomings. But since the community had been waiting for ages already to see any media counts, we published this non-perfect data nonetheless to get data out until WMF has infrastructure to produce perfect media counts.

Rationales, and motivations for this dataset can be found in the corresponding RfC. See also the March 2015 announcement email for this feed.

This dataset is owned by the Analytics Team and can be found here: https://dumps.wikimedia.org/other/mediacounts/daily/

There is an API that provides access to this data in a programatic manner. See the Wikimedia Analytics API documentation.

Contained data

Selected requests

The data contains all requests from the upload cache group that have

HTTP status code 200 (OK), or
HTTP status code 206 (Partial Content) and a Range header that starts in bytes=0-, but is not bytes=0-0.

The first condition matches the plain fetches of image, movie and audio files. The second condition matches beginnings of streamed media.

Corner cases

After some discussion with stake-holders (some parts in on-wiki, most parts in emails), requests with HTTP status code 304 (Not modified) do not get counted at this point, as more interest seems to be on media transfers than media requests. Ideally, it would be media consumption or media views, but there is currently no way to detect that easily from the logs.
When consuming streamed media and jumping back to the beginning of the file after having watched part of the file, counts as a new transfer.
When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not. The number of preloads might be as high as 50% of total requests for the file types supported by media viewer.

Fields

The dataset consists of the following fields

Field #	Name	Description
1	`base_name`	The name of the raw, original file without the leading `https?://upload.wikimedia.org` So for example for each of https://upload.wikimedia.org/wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/161px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg http://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/402px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg http://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/687px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg , the `base_name` is `/wikipedia/commons/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg`. `base_name` is normalized, and has special characters percent escaped. For images from Commons, you can get the file's page by replacing the first four path segments of the `base_name` by https://commons.wikimedia.org/wiki/File:. So for the above basename, the file's page on Commons is https://commons.wikimedia.org/wiki/File:Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg .
2	`total_response_size`	Total number of response bytes sent to the users for that file (and its transcodings).
3	`total`	Total number of transfers (counting both transfers of the raw, original and tiny thumbs as 1).
4	`original`	Total number of transfers of the raw, original file (transcodings, thumbs and the like are not counted here). Note, this includes JPG images embedded in pages without the thumb parameter or equivalent, as well as the "thumbnails" asked at a resolution equal or higher than the original image's resolution: in both cases in the original image is embedded directly (and downloaded upon visiting the page), rather than generating a derivative image. See example.
5	`transcoded_audio`	Total number of transfers of a file that got transcoded to an audio file. So for example when a FLAC file is requested as OGG file, the request is counted in this column. (Transfers for the raw, original FLAC file, would get counted in the `original` column.
6	n/a	Reserved for future use.
7	n/a	Reserved for future use.
8	`transcoded_image`	Total number of transfers of a file that got transcoded to an image file. So for example when a WebM file, or a GIF file is requested as JPG file, the request is counted in this column. Note, this seems to include (all?) thumbnails as well: the value is higher than 0 also for jpg images, which are rescaled to jpg rather than converted to other formats. (Transfers for the raw, original WebM, or the raw, original GIF file, would get counted in the `original` column.)
9	`transcoded_image_0_199`	Total number of transfers of a file that got transcoded to an image file, where 0 <= width <= 199. (This is a drill-down of the `transcoded_image` column.)
10	`transcoded_image_200_399`	Total number of transfers of a file that got transcoded to an image file, where 200 <= width <= 399. (This is a drill-down of the `transcoded_image` column.)
11	`transcoded_image_400_599`	Total number of transfers of a file that got transcoded to an image file, where 400 <= width <= 599. (This is a drill-down of the `transcoded_image` column.)
12	`transcoded_image_600_799`	Total number of transfers of a file that got transcoded to an image file, where 600 <= width <= 799. (This is a drill-down of the `transcoded_image` column.)
13	`transcoded_image_800_999`	Total number of transfers of a file that got transcoded to an image file, where 800 <= width <= 999. (This is a drill-down of the `transcoded_image` column.)
14	`transcoded_image_1000`	Total number of transfers of a file that got transcoded to an image file, where 1000 <= width. (This is a drill-down of the `transcoded_image` column.)
15	n/a	Reserved for future use.
16	n/a	Reserved for future use.
17	`transcoded_movie`	Total number of transfers of a file that got transcoded to a movie file. So for example when a WebM file is requested as OGV file, the request is counted in this column. (Transfers for the raw, original WebM file, would get counted in the `original` column.)
18	`transcoded_movie_0_239`	Total number of transfers of a file that got transcoded to a movie file, where 0 <= height <= 239. (This is a drill-down of the `transcoded_movie` column.)
19	`transcoded_movie_240_479`	Total number of transfers of a file that got transcoded to a movie file, where 240 <= height <= 479. (This is a drill-down of the `transcoded_movie` column.)
20	`transcoded_movie_480`	Total number of transfers of a file that got transcoded to a movie file, where 480 <= height. (This is a drill-down of the `transcoded_movie` column.)
21	n/a	Reserved for future use.
22	n/a	Reserved for future use.
23	`referer_internal`	Total number of transfers with a Referer from a WMF domain.
24	`referer_external`	Total number of transfers with a Referer from a non-WMF domain.
25	`referer_unknown`	Total number of transfers with an empty or invalid Referer.

Availability

dumps.wikimedia.org

The stream is available as daily TSV files at http://dumps.wikimedia.org/other/mediacounts/ and http://wikimedia.crc.nd.edu/other/mediacounts/ .

stat machines

The stream is available as daily TSV files at /mnt/hdfs/wmf/data/archive/mediacounts on stat machines

Analytics cluster

The stream is available as daily TSV files at /wmf/data/archive/mediacounts in the Analytics cluster.

In addition to those files, the data is also available at hourly granularity in Parquet format at /wmf/data/wmf/mediacounts, which is accessible in Hive through the wmf.mediacounts table.

hive (wmf)> desc mediacounts;
OK
col_name	data_type	comment
base_name           	string              	Base name of media file
total_response_size 	bigint              	Total number of bytes
total               	bigint              	Total #
original            	bigint              	Sum for the raw, original file
transcoded_audio    	bigint              	Sum for audio
transcoded_image    	bigint              	Sum for image (any width)
transcoded_image_0_199	bigint              	Sum for image (0 <= width <= 199)
transcoded_image_200_399	bigint              	Sum for image (200 <= width <= 399)
transcoded_image_400_599	bigint              	Sum for image (400 <= width <= 599)
transcoded_image_600_799	bigint              	Sum for image (600 <= width <= 799)
transcoded_image_800_999	bigint              	Sum for image (800 <= width <= 999)
transcoded_image_1000	bigint              	Sum for image (1000 <= width)
transcoded_movie    	bigint              	Sum for movie (any height)
transcoded_movie_0_239	bigint              	Sum for movie (0 <= height <= 239)
transcoded_movie_240_479	bigint              	Sum for movie (240 <= height <= 479)
transcoded_movie_480	bigint              	Sum for movie (480 <= height)
referer_internal    	bigint              	Sum for WMF referers
referer_external    	bigint              	Sum for refers from non-WMF domains
referer_unknown     	bigint              	Sum for empty/invalid referers
year                	int                 	Unpadded year
month               	int                 	Unpadded month
day                 	int                 	Unpadded day
hour                	int                 	Unpadded hour

Clients

mediacounts-stats.py can filter statistics for a specific file or category of files, keeping the same CSV format (example).
commons-media-views compacts the entire dataset to have only one row per filename and outputs the table in JSON format (example)

Events and known problems since 2015-01-01

Date from	Date until	Bug	Details