Commons Impact Metrics/Dumps

From Wikitech

The Commons Impact Metrics dumps consist of 5 datasets updated at a monthly schedule. They are formatted in TSV (tab separated values) and compressed using Bzip2. Some fields contain lists of strings; in which case, the strings are separated by | (pipe) symbols. You can download them from https://dumps.wikimedia.org/other/commons_impact_metrics/readme.html.

Category metrics snapshot

Field Type Description
category string The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores).
parent_categories list<string> The immediate ancestor (parent) category names of this row's category.
primary_categories list<string> The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories.
media_file_count int The number of media files contained in this (shallow) category.
media_file_count_deep int The number of media files contained in this (deep) category tree. Only available for primary allow-listed categories.
used_media_file_count int The number of media files from this (shallow) category featured in at least one wiki page.
used_media_file_count_deep int The number of media files from this (deep) category tree featured in at least one wiki page. Only available for primary allow-listed categories.
leveraging_wiki_count int The number of wikis featuring at least one of this (shallow) category’s media files.
leveraging_wiki_count_deep int The number of wikis featuring at least one of this (deep) category tree’s media files. Only available for primary allow-listed categories.
leveraging_page_count int The number of (namespace=0) pages featuring at least one of this (shallow) category’s media files.
leveraging_page_count_deep int The number of (namespace=0) pages featuring at least one of this (deep) category tree’s media files. Only available for primary allow-listed categories.
month string The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits).

Notes

  • Each row corresponds to a category or sub-category.
  • The metric values (int) are not aggregatable. All queries to this table should always filter or breakdown by category and month.

Media file metrics snapshot

Field Type Description
media_file string The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores).
media_type string The media type of the media file, coming from the Image table (img_media_type): BITMAP, VIDEO, etc.
categories list<string> The category names that the media file is directly associated with.
primary_categories list<string> The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories.
leveraging_wiki_count long The number of wikis featuring this media file at least in one (namespace=0) page.
leveraging_page_count long The number of (namespace=0) pages featuring this media file across all wikis.
month string The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits).

Notes:

  • Each row corresponds to a media file. Media files that are not used in any wiki (leveraging_wiki_count=0) do not appear in this dataset.
  • The metric values are not aggregatable. Queries to this dataset should always filter or breakdown by media_file and month.

Pageviews by category

Field Type Description
category string The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores).
category_scope string Either “shallow” (meaning only media files directly associated with the category were used to aggregate pageviews) or “deep” (meaning all media files within the category and all its recursive subcategories were used to aggregate pageviews).
primary_categories list<string> The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories.
wiki string The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title string The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring at least one media file of the corresponding category will appear here.
pageview_count long Aggregated pageview count for (namespace=0) pages featuring at least one media file from the category/scope. Rows with pageview_count=0 should be omitted!
month string The month for which we aggregate the data (YYYY-MM).

Notes:

  • This dataset aggregates counts to (namespace=0) wiki pages that include media files belonging to the specified category.
  • Each category (or sub-category) has 1 row for each page that includes its media files. Each page will have the corresponding pageview count.
  • Primary categories have data for category_scope=shallow (media files associated directly with them) and for category_scope=deep (media files belonging to its whole category tree). Sub-categories only have shallow data.
  • You can aggregate the pageview_count value only across the wiki, page_title, and month dimensions. All queries to this table should always filter or breakdown by category and category_scope.
  • Pageviews to the Main page are not counted.

Pageviews by media file

Field Type Description
media_file string The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores).
categories list<string> The category names that the media file is directly associated with.
primary_categories list<string> The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories.
wiki string The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature the media file at least once will appear here.
page_title string The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring the media file will appear here.
pageview_count long Aggregated pageview count for (namespace=0) pages featuring the media file. Rows with pageview_count=0 should be omitted!
month string The month for which we aggregate the data (YYYY-MM).

Notes:

  • This dataset aggregates counts to (namespace=0) wiki pages that include the specified media files.
  • Each media file has 1 row for each page that includes it. Each page will have the corresponding pageview count.
  • You can aggregate the pageview_count value only across the wiki, page_title, and month dimensions. All queries to this table should always filter or breakdown by media_file.
  • Pageviews to the Main page are not counted.

Edits

Field Type Description
user_name string The user name of the user who performed the edit. This is resolved from the actor table’s actor_name. If no actor is found, it is set to ‘anonymous’. If it has been suppressed, it is set to ‘redacted’.
edit_type string Either “create” (for the first revision of a media file page), or “update” (for all other revisions of the media file page).
media_file string The name of the edited media file. Coincides with the page title of the media file page in Commons. URL version (with underscores).
categories list<string> The category names that the media file is directly associated with.
primary_categories list<string> The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories.
dt timestamp The timestamp of the edit.

Notes:

  • This is an event-based dataset, each row corresponds to an edit event.
  • You can aggregate across any set of dimensions.