Data releases
This page describes the process involved in a formal open data release by the Wikimedia Research team. While the process and guidelines described on this page are not prescriptive for other teams at the Wikimedia Foundation, we encourage anyone involved in publishing static datasets for research purposes to follow these guidelines (and help us improve them).
Definition
- We define a formal data release or data publication the process of publishing a static dataset along with metadata and a persistent identifier through an open data repository.
- Optional steps in a formal data release may include: on-wiki documentation; a companion "dataset paper"; a notebook exploring the dataset; a blog post. See below for more examples.
- API releases and datasets not meant for research (and released primarily for operational purposes) typically do not fall within the scope of a formal data release.
- These guidelines also apply to researchers entering a formal collaboration with Wikimedia Foundation staff, and publishing datasets as part of the open data requirements of our Open Access policy.
Process
Conduct a privacy review
It is mandatory that, prior to releasing any data other than already public datasets, you conduct a thorough privacy review, by asking appropriate teams (Legal and Security) to review the proposed dataset, as well as the aggregation and anonymization strategy (if applicable). All datasets published by the Wikimedia Foundation are subject to our privacy policy and data retention guidelines.
Determine the appropriate license
Open datasets published by the Wikimedia Foundation will typically use CC0 as a default license/dedication. Exceptions include cases where contributions from Wikimedia editors that require attribution are included. Please consult with the Foundation's Legal team to determine the appropriate licensing scheme.
Prepare the data for publication
Prepare the dataset for publication in a suitable open format. Typical formats for open datasets are Tab-separated values, Comma-separated values, JSON, Newline-delimited JSON, RDF).
Upload the dataset to a server
For large datasets (and for redundancy) it is advisable to store a copy of a dataset on a Wikimedia-maintained public server. For example: https://analytics.wikimedia.org/published/datasets/archive/public-datasets/ Compress the data as appropriate before uploading it.
See Ad-Hoc Datasets documentation for how to publish data in this space.
Create a metadata entry for the dataset
To make the dataset persistently discoverable and citable, you should create a metadata entry in an open data repository. A metadata entry also allows to preserve the provenance of the data and to identify the organization or individual responsible for the creation and maintenance of the dataset. Popular open data repositories include Zenodo, Figshare, OpenDryad, Mendeley Data, Dataverse etc. The Wikimedia Research team has been using over the years Figshare for its own data releases. For an example of a well-documented dataset hosted on Figshare, check out:
- Halfaker, Aaron; Mansurov, Bahodir; Redi, Miriam; Taraborelli, Dario (2018): Citations with identifiers in Wikipedia. figshare. https://doi.org/10.6084/m9.figshare.1299540
A well-formed metadata entry for a dataset typically includes:
- The name of the authors (if applicable). Make sure to register an ORCID if you don't have one, so authorship and provenance is persistent and unambiguous
- A descriptive title
- A documentation of the format and schema of the dataset
- A persistent Digital Object Identifier (assigned by the repository upon publication)
- A license statement
- Additional references about the dataset
- Keywords and categories describing the dataset (for discoverability)
- A link to the server where the resources included in the dataset are hosted (if applicable)
Open data repositories allow creating "metadata only" entries (where the data is fully hosted on a different server) or "regular entries" (where the entry includes a copy of the data). Metadata entries on open repositories are version controlled, meaning that different versions of a dataset can be uniquely referenced and have their own documentation and resources.
Additional documentation
Further documentation of a dataset may include any of the following:
- on-wiki documentation: see for example Research:Wikipedia clickstream (as a companion to https://doi.org/10.6084/m9.figshare.1305770)
- a dataset paper: see for example TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia (as a companion to https://doi.org/10.5281/zenodo.789289)
- a notebook exploring the dataset: see for example ClickStream - Getting Started - Explorations (as a companion to https://doi.org/10.6084/m9.figshare.1305770)
- a blog post: see for example What are the ten most cited sources on Wikipedia? Let’s ask the data (as a companion to https://doi.org/10.6084/m9.figshare.1299540).