Commons Impact Metrics

From Wikitech

The Commons Impact Metrics data product is a collection of datasets designed to provide insight on the impact of Community contributions to Commons. So far, the data is focused on media files uploaded by- and categories belonging to GLAM actors (affiliates, projects, individual contributors etc.).

Project rationale

There has been a long-standing Community request on a data product that would give insight into the impact of Commons contributions. While the WMF has not been able to attend the request, the Community has created a list of tools which compute such data and serve it via visual web applications. Tools such as GLAM Wiki Dashboard, BaGLAMa2 and GLAMorgan. In the couple years before this project, the Community has reported that they have difficulties maintaining these tools for several reasons, and the tools have become less useful to the Community due to data outages, inconsistency between tools and the complexity of the calculations. This project aims to improve on those issues by delivering a data product that:

  • Answers to most of the use cases covered by the mentioned tools.
  • Is robust, not subject to data outages.
  • Is standardized and can be used consistently across a range of tools.
  • Provides pre-calculated data, easy to query and manage.

Category allow-list

Because of computational and data size reasons, we have scoped this data product to report only on a list of curated GLAM primary categories. Each of those categories belongs to a GLAM institution, event, contributor, project, etc. The data product will also report on all sub-categories under the listed primary categories. The initial allow-list was put together from the existing mentioned tools. But it will be open to additions. The current allow-list lives here. Note that this data product is still in BETA stage, and the allow-list can change.

Max depth

In Common's category graph, most sub-graphs are interconnected. You can navigate from a sub-graph about a given museum in a given Country, and end up in a sub-graph about a project in the other side of the world. In practice, if the allow-list mentioned above is big enough, navigating through the listed sub-graphs without limits might bring on to traversing the whole of Commons category graph. Because we want to report on GLAM-actor-specific sub-graphs we impose a limit to how deep an allow-listed category tree will be considered. Currently the max depth is 7. This means that this data product will only report on sub-categories that are at a maximum distance of 7 steps from the allow-listed primary category. The data will also report on all media files directly associated to any of those categories and sub-categories.

Aggregated and released monthly

Because of data size reasons, we currently aggregate the data in a monthly granularity. One of the design criteria of this product is that it should be manageable for Community members. Usually Community members do not have access to a cluster to run queries on top of hundreds of gigabytes of data. Thus we reduced the granularity to monthly to make it lighter and more manageable. On the other hand, because this dataset depends on data that we currently only ingest at a monthly pace, we can only offer a monthly release schedule.

How to access the data

The data is available in the form of dumps. Read more about them here. We also plan to publish the data via Analytics Query Service API by the end of Q4 FY2023-2024.