Data Platform/Data Lake/Traffic/Mediacounts
The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org
has been transferred to users.
WMF currently does not have infrastructure to provide perfect media counts, and the current media counts implementation has several shortcomings. But since the community had been waiting for ages already to see any media counts, we published this non-perfect data nonetheless to get data out until WMF has infrastructure to produce perfect media counts.
Rationales, and motivations for this dataset can be found in the corresponding RfC. See also the March 2015 announcement email for this feed.
This dataset is owned by the Analytics Team and can be found here: https://dumps.wikimedia.org/other/mediacounts/daily/
There is an API that provides access to this data in a programatic manner. See the Wikimedia Analytics API documentation.
Contained data
Selected requests
The data contains all requests from the upload
cache group that have
- HTTP status code 200 (OK), or
- HTTP status code 206 (Partial Content) and a Range header that starts in
bytes=0-
, but is notbytes=0-0
.
The first condition matches the plain fetches of image, movie and audio files. The second condition matches beginnings of streamed media.
Corner cases
- After some discussion with stake-holders (some parts in on-wiki, most parts in emails), requests with HTTP status code 304 (Not modified) do not get counted at this point, as more interest seems to be on media transfers than media requests. Ideally, it would be media consumption or media views, but there is currently no way to detect that easily from the logs.
- When consuming streamed media and jumping back to the beginning of the file after having watched part of the file, counts as a new transfer.
- When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not. The number of preloads might be as high as 50% of total requests for the file types supported by media viewer.
Fields
The dataset consists of the following fields
Field # | Name | Description |
---|---|---|
1 | base_name
|
The name of the raw, original file without the leading https?://upload.wikimedia.org
So for example for each of
, the
For images from Commons, you can get the file's page by replacing the first four path segments of the |
2 | total_response_size
|
Total number of response bytes sent to the users for that file (and its transcodings). |
3 | total
|
Total number of transfers (counting both transfers of the raw, original and tiny thumbs as 1). |
4 | original
|
Total number of transfers of the raw, original file (transcodings, thumbs and the like are not counted here). Note, this includes JPG images embedded in pages without the thumb parameter or equivalent, as well as the "thumbnails" asked at a resolution equal or higher than the original image's resolution: in both cases in the original image is embedded directly (and downloaded upon visiting the page), rather than generating a derivative image. See example. |
5 | transcoded_audio
|
Total number of transfers of a file that got transcoded to an audio file. So for example when a FLAC file is requested as OGG file, the request is counted in this column. (Transfers for the raw, original FLAC file, would get counted in the original column.
|
6 | n/a | Reserved for future use. |
7 | n/a | Reserved for future use. |
8 | transcoded_image
|
Total number of transfers of a file that got transcoded to an image file. So for example when a WebM file, or a GIF file is requested as JPG file, the request is counted in this column. Note, this seems to include (all?) thumbnails as well: the value is higher than 0 also for jpg images, which are rescaled to jpg rather than converted to other formats. (Transfers for the raw, original WebM, or the raw, original GIF file, would get counted in the original column.)
|
9 | transcoded_image_0_199
|
Total number of transfers of a file that got transcoded to an image file, where 0 <= width <= 199. (This is a drill-down of the transcoded_image column.)
|
10 | transcoded_image_200_399
|
Total number of transfers of a file that got transcoded to an image file, where 200 <= width <= 399. (This is a drill-down of the transcoded_image column.)
|
11 | transcoded_image_400_599
|
Total number of transfers of a file that got transcoded to an image file, where 400 <= width <= 599. (This is a drill-down of the transcoded_image column.)
|
12 | transcoded_image_600_799
|
Total number of transfers of a file that got transcoded to an image file, where 600 <= width <= 799. (This is a drill-down of the transcoded_image column.)
|
13 | transcoded_image_800_999
|
Total number of transfers of a file that got transcoded to an image file, where 800 <= width <= 999. (This is a drill-down of the transcoded_image column.)
|
14 | transcoded_image_1000
|
Total number of transfers of a file that got transcoded to an image file, where 1000 <= width. (This is a drill-down of the transcoded_image column.)
|
15 | n/a | Reserved for future use. |
16 | n/a | Reserved for future use. |
17 | transcoded_movie
|
Total number of transfers of a file that got transcoded to a movie file. So for example when a WebM file is requested as OGV file, the request is counted in this column. (Transfers for the raw, original WebM file, would get counted in the original column.)
|
18 | transcoded_movie_0_239
|
Total number of transfers of a file that got transcoded to a movie file, where 0 <= height <= 239. (This is a drill-down of the transcoded_movie column.)
|
19 | transcoded_movie_240_479
|
Total number of transfers of a file that got transcoded to a movie file, where 240 <= height <= 479. (This is a drill-down of the transcoded_movie column.)
|
20 | transcoded_movie_480
|
Total number of transfers of a file that got transcoded to a movie file, where 480 <= height. (This is a drill-down of the transcoded_movie column.)
|
21 | n/a | Reserved for future use. |
22 | n/a | Reserved for future use. |
23 | referer_internal
|
Total number of transfers with a Referer from a WMF domain. |
24 | referer_external
|
Total number of transfers with a Referer from a non-WMF domain. |
25 | referer_unknown
|
Total number of transfers with an empty or invalid Referer. |
Availability
dumps.wikimedia.org
The stream is available as daily TSV files at http://dumps.wikimedia.org/other/mediacounts/ and http://wikimedia.crc.nd.edu/other/mediacounts/ .
stat machines
The stream is available as daily TSV files at /mnt/hdfs/wmf/data/archive/mediacounts
on stat machines
Analytics cluster
The stream is available as daily TSV files at /wmf/data/archive/mediacounts
in the Analytics cluster.
In addition to those files, the data is also available at hourly granularity in Parquet format at /wmf/data/wmf/mediacounts
, which is accessible in Hive through the wmf.mediacounts
table.
hive (wmf)> desc mediacounts; OK col_name data_type comment base_name string Base name of media file total_response_size bigint Total number of bytes total bigint Total # original bigint Sum for the raw, original file transcoded_audio bigint Sum for audio transcoded_image bigint Sum for image (any width) transcoded_image_0_199 bigint Sum for image (0 <= width <= 199) transcoded_image_200_399 bigint Sum for image (200 <= width <= 399) transcoded_image_400_599 bigint Sum for image (400 <= width <= 599) transcoded_image_600_799 bigint Sum for image (600 <= width <= 799) transcoded_image_800_999 bigint Sum for image (800 <= width <= 999) transcoded_image_1000 bigint Sum for image (1000 <= width) transcoded_movie bigint Sum for movie (any height) transcoded_movie_0_239 bigint Sum for movie (0 <= height <= 239) transcoded_movie_240_479 bigint Sum for movie (240 <= height <= 479) transcoded_movie_480 bigint Sum for movie (480 <= height) referer_internal bigint Sum for WMF referers referer_external bigint Sum for refers from non-WMF domains referer_unknown bigint Sum for empty/invalid referers year int Unpadded year month int Unpadded month day int Unpadded day hour int Unpadded hour
Clients
- mediacounts-stats.py can filter statistics for a specific file or category of files, keeping the same CSV format (example).
- commons-media-views compacts the entire dataset to have only one row per filename and outputs the table in JSON format (example)
Events and known problems since 2015-01-01
Date from | Date until | Bug | Details |
---|
See also
- The code that calculates the numbers from the webrequest table (also uses e.g. this UDF)