Data Platform

Not to be confused with Data Services within Wikimedia Cloud Services.

Wikimedia's Data Platform is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the Data Platform Engineering team. To contact the team, use the intake process.

Get started

The Data Platform provides access to private data and internal WMF resources, so you must have specialized data access to use it. For public, open access Wikimedia data and tools, see meta:Research:Data.

Discover data

Find datasets and documentation for WMF private data sources.

Access and query data

Use SQL query engines, Jupyter notebooks, libraries, and compute resources to explore and analyze data.

Transform and publish data

Create and share derivative datasets, reports, and dashboards based on existing Wikimedia data sources.

Collect data

Use the Metrics Platform to configure instruments and collect analytics data.

Advanced users: use the Event Platform to configure and deploy event streams.

Data platform infrastructure

Data platform systems and infrastructure include the data lake, ingestion and processing pipelines, and production search and query services.

Data pipelines

Information about data pipelines is currently at:

Search data and services

Overview of data platform systems

Data Platform Technical Overview 2023
Analytics Data Platform 2021

The following list highlights some major Data Platform systems. For more details and a full list of Data Platform system documentation pages on this wiki, see Data_Platform/Systems.

System name and link	Type	Accessibility
Airflow	Workflow Job Scheduler	Private
Archiva	Repository for Java archives	Private
AQS - Analytics Query Service	REST API for analytics data	Public
Ceph	Software defined storage, serving block and object storage	Private
Clients (stat100X)	Analytics client nodes to access Hadoop and various services	Private
Cluster (Hadoop, Gobblin, Hive, Spark...)	Hadoop	Private
Datahub	Data Catalog	Private
Dashiki	Framework for building dashboards	Public
Druid	Data storage engine optimized for exploratory analytics	Private
EventLogging	Ad-hoc streaming pipeline	Private
EventStreams	Mediawiki events streams	Public
Kafka	Data transport and streaming system	Private
MariaDB	Data storage for MediaWiki replicas and EventLogging	Private
Matomo (formerly known as Piwik)	Small-scale web analytics platform	Private
Presto	Big data high performance sql query engine	Private
ReportUpdater	Job Scheduler	Private
Superset	Web interface for data visualization and exploration	Private
Jupyter	Hosted notebooks for data analysis	Private
Turnilo	Web interface for exploring data stored in Druid	Private
Wikistats (1 and 2)	Community Dashboard with high-level metrics	Public
Wmfdata-Python	Python package for streamlined data access on the analytics clients	Private

Full list of Data Platform systems

Data platform operations

Find ops week and other process documentation at Data Platform Engineering on Wikitech and the project pages on MediaWiki.org.

The list of scheduled manual maintenance tasks are documented at Data Platform/Systems/Manual maintenance