Data Engineering/Ownership
Jump to navigation
Jump to search

What We Own
We are responsible for a number of systems, datasets, and pipelines.
Systems
We maintain the big data platform including the data lake, ingestion and processing pipelines, as well as a number of systems to explore and visualize the data.
Please see Data Engineering/Systems for a more comprehensive list of the systems we maintain.

System name and link | Type | Accessibility |
---|---|---|
Airflow | Workflow Job Scheduler | Private |
Archiva | Repository for Java archives | Private |
AQS - Analytics Query Service | REST API for analytics data | Public |
Ceph | Software defined storage, serving block and object storage | Private |
Clients (stat100X) | Analytics client nodes to access Hadoop and various services | Private |
Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...) | Hadoop | Private |
Datahub | Data Catalog | Private |
Dashiki | Framework for building dashboards | Public |
Druid | Data storage engine optimized for exploratory analytics | Private |
EventLogging | Ad-hoc streaming pipeline | Private |
EventStreams | Mediawiki events streams | Public |
Hue | Web interface for Hive, Oozie, and other Cluster services | Private |
Kafka | Data transport and streaming system | Private |
MariaDB | Data storage for MediaWiki replicas and EventLogging | Private |
Matomo (formerly known as Piwik) | Small-scale web analytics platform | Private |
Presto | Big data high performance sql query engine | Private |
ReportUpdater | Job Scheduler | Private |
Superset | Web interface for data visualization and exploration | Private |
Jupyter | Hosted notebooks for data analysis | Private |
Turnilo | Web interface for exploring data stored in Druid | Private |
Wikistats (1 and 2) | Community Dashboard with high-level metrics | Public |
Wmfdata-Python | Python package for streamlined data access on the analytics clients | Private |
The list of scheduled manual maintenance tasks are documented at Analytics/Systems/Manual maintenance
Datasets
Please also refer to Analytics/Data Lake for more links to reference material.
- Webrequests [Traffic logs] and derived tables, including:
- Pageviews [Filtered traffic logs] [TODO - Revamp and add various systems and key differences in schema and usage]
- Inter-language [Traffic between different languages of the same project family]
- Unique Devices Estimates of unique devices at the project or project family level
- Mediawiki raw databases
- EventLogging (in the event database in hive)
- Edits history, Page history, User history
- Other reports
- Clickstream
Pipelines
Please also refer to Analytics/Systems/Cluster for more reference information about the pipelines we manage.
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
- ORES scores