Data Engineering/Ownership

From Wikitech

What We Own

We are responsible for a number of systems, datasets, and pipelines.

Systems

We own the data platform including the data lake, ingestion and processing pipelines.

Please see Analytics/Systems for more details about the Analytics/Cluster and its ecosystem of analytics related tooling.

Please see Data Engineering/Systems for a list of the additional systems we maintain, which are in support of the Data Platform, but are not strictly related to analytics.

System name and link Type Accessibility
Airflow Workflow Job Scheduler Private
Archiva Repository for Java archives Private
AQS - Analytics Query Service REST API for analytics data Public
Ceph Software defined storage, serving block and object storage Private
Clients (stat100X) Analytics client nodes to access Hadoop and various services Private
Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...) Hadoop Private
Datahub Data Catalog Private
Dashiki Framework for building dashboards Public
Druid Data storage engine optimized for exploratory analytics Private
EventLogging Ad-hoc streaming pipeline Private
EventStreams Mediawiki events streams Public
Hue Web interface for Hive, Oozie, and other Cluster services Private
Kafka Data transport and streaming system Private
MariaDB Data storage for MediaWiki replicas and EventLogging Private
Matomo (formerly known as Piwik) Small-scale web analytics platform Private
Presto Big data high performance sql query engine Private
ReportUpdater Job Scheduler Private
Superset Web interface for data visualization and exploration Private
Jupyter Hosted notebooks for data analysis Private
Turnilo Web interface for exploring data stored in Druid Private
Wikistats (1 and 2) Community Dashboard with high-level metrics Public
Wmfdata-Python Python package for streamlined data access on the analytics clients Private

The list of scheduled manual maintenance tasks are documented at Analytics/Systems/Manual maintenance

Datasets

Please also refer to Analytics/Data Lake for more links to reference material.

Pipelines

Please also refer to Analytics/Systems/Cluster for more reference information about the pipelines we manage.

Traffic data
Webrequest, pageviews, and unique devices
Edits data
Historical data about revisions, pages, and users (e.g. MediaWiki History)
Content data
Wikitext (latest & historical) and wikidata-entities
Events data
EventLogging, EventBus and event streams data (raw, refined, sanitized)
ORES scores