Commons Impact Metrics/Data Model
This page describes the data model of the Commons Impact Metrics data product. It includes detailed descriptions of all dimensions and metrics as well as the detailed schemas of the 5 base datasets stored in Hive (Iceberg) and exported as dumps. It does not include information about the Commons Impact Metrics API (AQS service) or the corresponding Cassandra tables, although some of the definitions are shared. It also does not include contextual information on the project, caveats, nor how to access the data. For find more general information about the project, see Commons Impact Metrics.
Field glossary
The following table contains detailed definitions for all fields of the Commons Impact Metrics data model. They are valid for all datasets and datastores down the data pipeline: Hive (Iceberg), Cassandra, AQS and dumps.
Field name | Description | ||||
---|---|---|---|---|---|
category | The name (or title) of a category, as it appears in the URL of the corresponding category page in Commons. For example: Images_from_Memòria_Digital_de_Catalunya is the category name of https://commons.wikimedia.org/wiki/Category:Images_from_Memòria_Digital_de_Catalunya.
This normalized version of the category name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores. | ||||
category_scope | It is either "shallow" or "deep". It always goes together with a category field and a metric count field.
| ||||
parent_categories
or categories |
It goes together with a category or a media file field. It is a list of the categories that are a direct parent to the associated field. They use the same format as the category field described above. Note: When the associated field is a category, this field is called parent_categories. When the associated field is a media file, this field is called categories. | ||||
primary_categories | It goes together with a category or a media_file field. It is a list of the categories that are a top ancestor to the associated field. They use the same format as the category field described above. Note: Top ancestors are the ones listed in the Commons Impact Metrics allow-list by definition. | ||||
media_file | The name (or title) of a media file, as it appears in the URL of the corresponding media file page in Commons. For example: A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg is the media file name of https://commons.wikimedia.org/wiki/File:A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg.
This normalized version of the media file name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores. | ||||
media_type | Indicates the type of media file. The values are directly imported from the img_media_type field in MediaWiki's image table, i.e. "BITMAP", "VIDEO", etc. | ||||
user_name | A user name, as it appears in the actor_name field of MediaWiki's actor table. It goes together with an edit event. If no actor is found for the edit, user_name is ":anonymous:". If the edit's actor has been suppressed, user_name is ":redacted:". The colon symbols are added to those special values to prevent collisions with existing user names. | ||||
edit_type | Indicates the type of edit to a media file. Either “create”, for the first revision of a media file page; or “update”, for all subsequent revisions of the media file page. | ||||
wiki | The canonical name of a wiki. It indicates the language and the family of the wiki in the format "<language>.<family>". I.e. “en.wikipedia” or “ja.wiktionary”. | ||||
page_title | The title of a wiki page, as it appears in the URL of the corresponding page in some wiki. For example: दिल्ली is the page title of https://hi.wikipedia.org/wiki/दिल्ली. This normalized version of the page title (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores. Note: This data product only reports on namespace=0 wiki pages. | ||||
year_month | For the aggregated datasets, Pageviews per Category and Pageviews per Media File, year_month indicates the month for which the data is aggregated. For the snapshot based datasets, Category Metrics Snapshot and Media File Metrics Snapshot, year_month indicates the month at the end of which the snapshot was taken. In all cases the format is "yyyy-MM", i.e. "2024-01". | ||||
dt | The timestamp of an edit. It has different formats depending on the datastore. In Hive (Iceberg) it is of type TIMESTAMP (granularity in seconds). In the dumps is has the format "yyyy-MM-dd'T'HH:mm:ss". It is not reported via AQS or dumps. | ||||
pageview_count | This metric is always associated with a given Commons media file, or a given Commons category. When it's associated with a media file, it counts the number of pageviews to wiki pages containing that media file. When it's associated with a category, it counts the number of pageviews to wiki pages containing any media file belonging to that category. More specifically, any media file directly associated with the category, if category_scope="shallow"; and any media file within the whole associated category tree, if category_scope="deep".
The pageview_count metric can be broken down by wiki, by page_title and by year_month. It is also an additive metric, so it can be aggregated across those same 3 dimensions. For instance, if the pageview_count metric value for January is P1 and for February is P2, then the combined value since start of January until end of February is P1+P2. The pageview_count metric has a known caveat in the context of this project: monthly drift. Read more about monthly drift. | ||||
media_file_count | This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category. | ||||
media_file_count_deep | This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree. Meaning the media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.). | ||||
used_media_file_count | This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category, which appear in at least 1 wiki page. | ||||
used_media_file_count_deep | This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree, which appear in at least 1 wiki page. Meaning the used media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.). | ||||
leveraging_wiki_count | This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wikis having at least 1 page that features that media file. When associated with a category, it counts the number of wikis having at least 1 page that features any media file directly belonging to the associated category. | ||||
leveraging_wiki_count_deep | This metric is always associated with a given Commons category. It counts the number of wikis having at least 1 page that features any media file belonging to the associated category tree (the category itself, its children, grandchildren, etc.). | ||||
leveraging_page_count | This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wiki pages (namespace=0) featuring that media file. When associated with a category, it counts the number of wiki pages (namespace=0) featuring any media file directly belonging to the associated category. | ||||
leveraging_page_count_deep | This metric is always associated with a given Commons category. It counts the number of wiki pages (namespace=0) featuring any media file belonging to the associated category tree (the category itself, its children, its grandchildren, etc.). | ||||
edit_count | This metric is always associated with a given wiki user name, or a given Commons category. When associated with a user name, it counts the number of edits performed by the associated user, to media files belonging to the allow-listed Commons categories. When associated with a category, it counts the number of edits to media files belonging to the associated category. More specifically, edits to media files directly associated with the category, if category_scope="shallow"; and edits to media files within the whole associated category tree, if category_scope="deep". |
Base datasets
The Commons Impact Metrics data product consists of 5 base datasets stored in Hive (Iceberg), and also exported in the form of dumps. Two of them (Category metrics snapshot and Media file metrics snapshot) are snapshot-based. Two of them (Pageviews per category monthly and Pageviews per media file monthly) are aggregated cubes. The last one (Edits) is event-based.
Snapshot-based datasets
Snapshot-based datasets capture the state of things at a given point in time. In Commons Impact Metrics, the snapshot is taken at the end of the month in question. The metrics included in snapshot-based datasets are absolute counts since the beginning of time, up to the instant of the snapshot. For instance, the number of media files contained inside a Commons category. Note that such metrics can not be aggregated over time.
Category metrics snapshot
This dataset stores metrics about Commons categories. Each row corresponds to a category (for a given snapshot). It can be an allow-listed (primary) category or one of its subcategories. Only allow-listed (primary) categories report on "deep" metrics, all categories report on "shallow" metrics. The metric values (int) are not aggregatable. All queries to this table should always filter or breakdown by category and year_month.
Schema
Field | Type | Description |
category | string | The name of the category this row refers to. |
parent_categories | list<string> | The immediate ancestor category names of this row's category. |
primary_categories | list<string> | The top ancestor category names of this row’s category. |
media_file_count | int | The number of media files contained in this category. |
media_file_count_deep | int | The number of media files contained in this category tree. Only available for primary allow-listed categories. |
used_media_file_count | int | The number of media files from this category featured in at least one wiki page. |
used_media_file_count_deep | int | The number of media files from this category tree featured in at least one wiki page. Only available for primary allow-listed categories. |
leveraging_wiki_count | int | The number of wikis featuring at least one of this category’s media files. |
leveraging_wiki_count_deep | int | The number of wikis featuring at least one of this category tree’s media files. Only available for primary allow-listed categories. |
leveraging_page_count | int | The number of pages featuring at least one of this category’s media files. |
leveraging_page_count_deep | int | The number of pages featuring at least one of this category tree’s media files. Only available for primary allow-listed categories. |
year_month | string | The month after which the snapshot was taken (YYYY-MM). |
Sample queries
-- Get shallow metrics about a given category.
SELECT
media_file_count,
used_media_file_count,
leveraging_wiki_count,
leveraging_page_count
FROM category_metrics_snapshot
WHERE
year_month = "2024-01" AND
category = "My_category"
;
-- Get a time series that shows the usage of a category tree over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
year_month,
used_media_file_count_deep
FROM category_metrics_snapshot
WHERE catgegory = "My_allow_listed_category"
GROUP BY year_month
ORDER BY year_month ASC
;
-- Get the category graph for an allow-listed category tree,
-- represented as edges of the form (parent_category, child_category).
SELECT
EXPLODE(parent_categories) AS parent_category,
category AS child_category
FROM category_metrics_snapshot
WHERE
year_month = "2024-01" AND
ARRAY_CONTAINS(primary_categories, "My_allow_listed_category")
;
Media file metrics snapshot
This dataset stores metrics about Commons media files. Each row corresponds to a media file (for a given snapshot). To reduce the size of the data, media files that are not featured in any wiki page do not appear in this dataset. The metric values (int) are not aggregatable. All queries to this table should always filter or breakdown by media_file and year_month.
Schema
Field | Type | Description |
media_file | string | The name of the media file this row refers to. |
media_type | string | The media type of the media file. |
categories | list<string> | The immediate ancestor category names of this row's media file. |
primary_categories | list<string> | The top ancestor category names of this row’s media file. |
leveraging_wiki_count | long | The number of wikis featuring this row's media file. |
leveraging_page_count | long | The number of wiki pages (across all wikis) featuring this row's media file. |
year_month | string | The month after which the snapshot was taken (YYYY-MM). |
Sample queries
-- Get metrics about a given media file.
SELECT
leveraging_wiki_count,
leveraging_page_count
FROM media_file_metrics_snapshot
WHERE
year_month = "2024-01" AND
media_file = "My_media_file.jpg"
;
-- Get a time series that shows the leverage of a media file over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
year_month,
leveraging_page_count
FROM media_file_metrics_snapshot
WHERE media_file = "My_media_file.jpg"
GROUP BY year_month
ORDER BY year_month ASC
;
Aggregated cubes
Aggregated cube datasets store metric aggregations broken down by a set of dimensions. Ideally, the metrics are additive, and you can slice and dice the dataset aggregating across all dimensions including time. In Commons Impact Metrics, the aggregation is done at a monthly granularity, for data size reasons. And the only metric is pageview_count (see its definition in the metrics table above).
Pageviews per category monthly
This dataset stores pageview counts for Commons categories, as explained in the metric definitions section above. The pageview_counts are dimensioned by wiki, page_title and year_month. You can aggregate pageview_counts only across those 3 dimensions. You can not aggregate across the category dimension. Pageviews to a wiki page can be attributed to more than one category, when that page contains media files from several categories. So, aggregating across the category dimension will produce duplicate counts. You can't either aggregate across the category_scope dimension, since the pageview_count metric has different meanings for category_scope="shallow" vs. category_scope="deep" (read more in the metric definitions section above). So, you should always filter or breakdown by both category and category_scope. Only allow-listed (primary) categories report pageviews for category_scope="deep". All categories report pageviews for category_scope="shallow". Pageviews to a wiki's Main page are not counted. Rows with pageview_count=0 are omitted.
Schema
Field | Type | Description |
category | string | The name of the category this row refers to. |
category_scope | string | The scope of the category; either "shallow" or "deep". |
primary_categories | list<string> | The top ancestor category names of this row’s media file. |
wiki | string | The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here. |
page_title | string | The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here. |
pageview_count | long | Aggregated pageview count. |
year_month | string | The month for which we aggregate the data (YYYY-MM). |
Sample queries
-- Get all-time aggregated pageview counts for a given category tree.
SELECT SUM(pageview_count) AS pageview_count
FROM pageviews_per_category_monthly
WHERE
category = "My_primary_category" AND
category_scope = "deep"
;
-- Get a time series that shows the pageviews of a category over time.
SELECT
year_month,
pageview_count
FROM pageviews_per_category_monthly
WHERE
category = "My_category" AND
category_scope = "shallow"
GROUP BY year_month
ORDER BY year_month ASC
;
-- Get a 2023 rank of the wiki pages with most pageviews
-- containing media files from a given category tree.
SELECT
wiki,
page_title,
SUM(pageview_count) AS pageview_count
FROM pageviews_per_category_monthly
WHERE
year_month BETWEEN "2023-01" AND "2023-12" AND
category = "My_primary_category" AND
category_scope = "deep"
GROUP BY
wiki,
page_title
ORDER BY pageview_count DESC
LIMIT 100
;
Pageviews per media file monthly
This dataset stores pageview counts for Commons media files, as explained in the metric definitions section above. The pageview_counts are dimensioned by wiki, page_title and year_month. You can aggregate pageview_counts only across those 3 dimensions. You can not aggregate across the media_file dimension. Pageviews to a wiki page can be attributed to more than one media file, when that page contains several media files. So, aggregating across the media_file dimension will produce duplicate counts. So, you should always filter or breakdown by media_file. Pageviews to a wiki's Main page are not counted. Rows with pageview_count=0 are omitted.
Schema
Field | Type | Description |
media_file | string | The name of the media file this row refers to. |
categories | list<string> | The parent category names of this row's media file. |
primary_categories | list<string> | The top ancestor category names of this row’s media file. |
wiki | string | The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here. |
page_title | string | The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here. |
pageview_count | long | Aggregated pageview count. |
year_month | string | The month for which we aggregate the data (YYYY-MM). |
Sample queries
-- Get all-time aggregated pageview counts for a given media file.
SELECT SUM(pageview_count) AS pageview_count
FROM pageviews_per_media_file_monthly
WHERE media_file = "My_media_file.jpg"
;
-- Get a time series that shows the pageviews of a media file over time.
SELECT
year_month,
pageview_count
FROM pageviews_per_media_file_monthly
WHERE media_file = "My_media_file.jpg"
GROUP BY year_month
ORDER BY year_month ASC
;
-- Get a 2023 rank of the wiki pages with most pageviews
-- containing a given media file.
SELECT
wiki,
page_title,
SUM(pageview_count) AS pageview_count
FROM pageviews_per_media_file_monthly
WHERE
year_month BETWEEN "2023-01" AND "2023-12" AND
media_file = "My_media_file"
GROUP BY
wiki,
page_title
ORDER BY pageview_count DESC
LIMIT 100
;
Event-based datasets
In event-based datasets each row represents an event, something that happened in the context of the dataset. Event-based datasets always have a field containing the timestamp of the event, with very fine granularity, i.e. milliseconds. Usually, they don't have any metric value.
Edits
This is an event-based dataset. Each row corresponds to an edit event performed on a Commons media file belonging to an allow-listed category tree. The dataset does not contain any metric per se, but you can aggregate row counts across any set of dimensions.
Schema
Field | Type | Description |
user_name | string | The user name of the user who performed the edit. |
edit_type | string | Either “create” or “update”. |
media_file | string | The name of the edited media file. |
categories | list<string> | The (parent) category names that the media file is directly associated with. |
primary_categories | list<string> | The top ancestor category names of this row’s media file. |
dt | timestamp | The timestamp of the edit. |
Sample queries
-- Get all-time aggregated edit counts for a given allow-listed category tree.
SELECT SUM(edit_count) AS edit_count
FROM edits
WHERE ARRAY_CONTAINS(primary_categories, "My_primary_category")
;
-- Get a time series that shows the edits to a media file over time.
SELECT
month(dt) AS year_month,
edit_count
FROM edits
WHERE media_file = "My_media_file.jpg"
GROUP BY month(dt)
ORDER BY year_month ASC
;
-- Get a 2023 rank of the user names with most edits
-- to media files belonging to a given category.
SELECT
user_name,
SUM(edit_count) AS edit_count
FROM edits
WHERE
dt BETWEEN TO_TIMESTAMP("2023-01") AND TO_TIMESTAMP("2023-12-31 23:59:59.999") AND
ARRAY_CONTAINS(categories, "My_category")
GROUP BY user_name
ORDER BY edit_count DESC
LIMIT 100
;