User:Triciaburmeister/Sandbox/Data platform/Discover data

This page is currently a draft.
More information and discussion about changes to this draft on the talk page.

This page provides links to data documentation for private and public Wikimedia data sources. Its primary audience is WMF data analysts, product teams, and researchers who have an official non-disclosure agreement with the Wikimedia Foundation.

Private data requires production data access. It includes datasets in WMF's Data Lake: a large, analytics-oriented repository of data about Wikimedia projects.
A selection of public data sources are linked here, but public Wikimedia data is described more fully at meta:Research:Data.

Traffic data

Analytics data about wiki pageviews and site usage.

Private traffic data

Most Data Lake traffic datasets are updated at hourly granularity, with 2-3 hours lag behind real-time. This data includes:

Full dataset list at Data Lake/Traffic.

View datasets tagged with "traffic" in DataHub (requires a developer account)

Public traffic data

APIs:

Specialized datasets:

Differentially private pageviews

Dumps:

Dashboards:

Content data

Datasets that contain full content of revisions for Wikimedia wikis.

Private content data

mediawiki_wikitext_current: wikitext last-revision per-page of Wikimedia wikis.
mediawiki_wikitext_history: full content of all revisions, past and present, from Wikimedia wikis.
wikidata_entity:conversion of the Wikidata entities JSON dumps in parquet.
wikidata_item_page_link: links between a Wikidata item and its related Wikipedia pages in various languages.

You can access MediaWiki replica databases through Wikimedia Cloud Services.

Public content data

APIs:

Internal WMF users can query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.

Dumps:

Specialized datasets:

Content Gaps

MediaWiki database tables:

Text

Contributing and edits data

Data about wiki revisions, pages, and users. Includes data about editors and their characteristics.

Private edits data

Edits datasets are generated as monthly snapshots, not continuously updated. This data includes:

MediaWiki_history: Fully denormalized dataset with user, page and revision data
Raw, unprocessed copies of MediaWiki database tables, bundled to facilitate cross-wiki queries.

Full dataset list at Data Lake/Edits.

Private contributors data

Private datasets about contributors or editors includes:

Geoeditors: Counts of editors by project by country

View datasets tagged with "editors" in DataHub (requires a developer account))

Public edits data

APIs:

Internal WMF users can query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.

Dumps:

Mediawiki_history

MediaWiki database tables:

Revision table

Dashboards:

Public contributors data

APIs:

Dumps:

Specialized datasets:

Differentially private geoeditors (hourly/monthly)

MediaWiki database tables:

Dashboards:

Instrumentation and events data

View and query events data

Through the Event Platform and Metrics Platform, you can create and deploy your own instruments to collect event data.

Events are ingested into event and event_sanitized databases in the Data Lake.

The Hive table name is a normalized version of the stream name.
The eventdatabase stores original (unsanitized) events within a 90 day retention period.
The event_sanitized database is an archive of sanitized events, beyond the 90 day retention period.
- Sanitized event data is processed per WMF’s Privacy Policy and Data Retention Guidelines.

After the data becomes available, you can access it with standard query tools and create dashboards based on the data. See the Instrumentation tutorial for how to consume events directly from Kafka or through the internal EventStreams instance.

How to query private data

Visit Analyze data to learn how to run queries and generate visualizations using WMF private datasets and analysis tools.

Report data issues

Data Issue reports