Jump to content

Data Platform

From Wikitech

Wikimedia's Data Platform is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the Data Platform Engineering team. To contact the team, use the intake process.

Get started

The Data Platform provides access to private data and internal WMF resources, so you must have specialized data access to use it. For public, open access Wikimedia data and tools, see meta:Research:Data.

Find datasets and documentation for WMF private data sources.

Use SQL query engines, Jupyter notebooks, libraries, and compute resources to explore and analyze data.

Create and share derivative datasets, reports, and dashboards based on existing Wikimedia data sources.

Use the Metrics Platform to configure instruments and collect analytics data.

  • Advanced users: use the Event Platform to configure and deploy event streams.

Data platform infrastructure

Data platform systems and infrastructure include the data lake, ingestion and processing pipelines, and production search and query services.

Data pipelines

Information about data pipelines is currently at:

Search data and services

Overview of data platform systems

The following list highlights some major Data Platform systems. For more details and a full list of Data Platform system documentation pages on this wiki, see Data_Platform/Systems.

System name and link Type Accessibility
Airflow Workflow Job Scheduler Private
Archiva Repository for Java archives Private
AQS - Analytics Query Service REST API for analytics data Public
Ceph Software defined storage, serving block and object storage Private
Clients (stat100X) Analytics client nodes to access Hadoop and various services Private
Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...) Hadoop Private
Datahub Data Catalog Private
Dashiki Framework for building dashboards Public
Druid Data storage engine optimized for exploratory analytics Private
EventLogging Ad-hoc streaming pipeline Private
EventStreams Mediawiki events streams Public
Hue Web interface for Hive, Oozie, and other Cluster services Private
Kafka Data transport and streaming system Private
MariaDB Data storage for MediaWiki replicas and EventLogging Private
Matomo (formerly known as Piwik) Small-scale web analytics platform Private
Presto Big data high performance sql query engine Private
ReportUpdater Job Scheduler Private
Superset Web interface for data visualization and exploration Private
Jupyter Hosted notebooks for data analysis Private
Turnilo Web interface for exploring data stored in Druid Private
Wikistats (1 and 2) Community Dashboard with high-level metrics Public
Wmfdata-Python Python package for streamlined data access on the analytics clients Private

Data platform operations

Find ops week and other process documentation at Data Platform Engineering on Wikitech and the project pages on MediaWiki.org.

The list of scheduled manual maintenance tasks are documented at Data Platform/Systems/Manual maintenance