User:Triciaburmeister/Sandbox/Data platform/Discover data reformatted
More information and discussion about changes to this draft on the talk page.
This page provides links to data documentation for private and public Wikimedia data sources. Its primary audience is WMF data analysts, product teams, and researchers who have an official non-disclosure agreement with the Wikimedia Foundation.
- Private data requires production data access. It includes datasets in WMF's Data Lake: a large, analytics-oriented repository of data about Wikimedia projects.
- A selection of public data sources are linked here, but public Wikimedia data is described more fully at meta:Research:Data.
Traffic data
Analytics data about wiki pageviews and site usage.
Private traffic data
Most Data Lake traffic datasets are updated at hourly granularity, with 2-3 hours lag behind real-time. This data includes:
Full dataset list at Data Lake/Traffic.
View datasets tagged with "traffic" in DataHub (requires a developer account)
Public traffic data
APIs:
Specialized datasets:
Dashboards:
Content data
Datasets that contain full content of revisions for Wikimedia wikis.
Private content data
- mediawiki_wikitext_current: wikitext last-revision per-page of Wikimedia wikis.
- mediawiki_wikitext_history: full content of all revisions, past and present, from Wikimedia wikis.
- wikidata_entity:conversion of the Wikidata entities JSON dumps in parquet.
- wikidata_item_page_link: links between a Wikidata item and its related Wikipedia pages in various languages.
Public content data
APIs:
Specialized datasets:
MediaWiki database tables:
Contributing and edits data
Data about wiki revisions, pages, and users. Includes data about editors and their characteristics.
Private edits data
Edits datasets are generated as monthly snapshots, not continuously updated. This data includes:
- MediaWiki_history: Fully denormalized dataset with user, page and revision data
- Raw, unprocessed copies of MediaWiki database tables, bundled to facilitate cross-wiki queries.
Full dataset list at Data Lake/Edits.
Public edits data
APIs:
MediaWiki database tables:
Dashboards:
Private contributors data
Private datasets about contributors or editors includes:
- Geoeditors: Counts of editors by project by country
View datasets tagged with "editors" in DataHub (requires a developer account))
Public contributors data
Instrumentation data
EventLogging enables Wikimedia data analysts and researchers to track data that MediaWiki doesn't normally log. Through the Event Platform and Metrics Platform you can create your own instruments using event data like:
- Logs of changes to user preferences
- A/B testing data
- Clicktracking data
These datasets are stored in the event
and event_sanitized
Hive databases, subject to HDFS access control. To learn how to access these data sources to create your own instruments, visit Collect data.
How to query private data
Visit Analyze data to learn how to run queries and generate visualizations using WMF private datasets and analysis tools.