Data Platform
Wikimedia's Data Platform is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the Data Platform Engineering team. To contact the team, use the intake process.
Get started
The Data Platform provides access to private data and internal WMF resources, so you must have specialized data access to use it. For public, open access Wikimedia data and tools, see meta:Research:Data.
Find datasets and documentation for WMF private data sources.
Use SQL query engines, Jupyter notebooks, libraries, and compute resources to explore and analyze data.
Create and share derivative datasets, reports, and dashboards based on existing Wikimedia data sources.
Use the Metrics Platform to configure instruments and collect analytics data.
- Advanced users: use the Event Platform to configure and deploy event streams.
Data platform infrastructure
Data platform systems and infrastructure include the data lake, ingestion and processing pipelines, and production search and query services.
Data pipelines
Information about data pipelines is currently at:
Search data and services
Overview of data platform systems
-
Data Platform Technical Overview 2023
-
Analytics Data Platform 2021
The following list highlights some major Data Platform systems. For more details and a full list of Data Platform system documentation pages on this wiki, see Data_Platform/Systems.
System name and link | Type | Accessibility |
---|---|---|
Airflow | Workflow Job Scheduler | Private |
Archiva | Repository for Java archives | Private |
AQS - Analytics Query Service | REST API for analytics data | Public |
Ceph | Software defined storage, serving block and object storage | Private |
Clients (stat100X) | Analytics client nodes to access Hadoop and various services | Private |
Cluster (Hadoop, Gobblin, Hive, Spark...) | Hadoop | Private |
Datahub | Data Catalog | Private |
Dashiki | Framework for building dashboards | Public |
Druid | Data storage engine optimized for exploratory analytics | Private |
EventLogging | Ad-hoc streaming pipeline | Private |
EventStreams | Mediawiki events streams | Public |
Kafka | Data transport and streaming system | Private |
MariaDB | Data storage for MediaWiki replicas and EventLogging | Private |
Matomo (formerly known as Piwik) | Small-scale web analytics platform | Private |
Presto | Big data high performance sql query engine | Private |
ReportUpdater | Job Scheduler | Private |
Superset | Web interface for data visualization and exploration | Private |
Jupyter | Hosted notebooks for data analysis | Private |
Turnilo | Web interface for exploring data stored in Druid | Private |
Wikistats (1 and 2) | Community Dashboard with high-level metrics | Public |
Wmfdata-Python | Python package for streamlined data access on the analytics clients | Private |
Data platform operations
Find ops week and other process documentation at Data Platform Engineering on Wikitech and the project pages on MediaWiki.org.
The list of scheduled manual maintenance tasks are documented at Data Platform/Systems/Manual maintenance