What We Own
We maintain the big data platform including the data lake, ingestion and processing pipelines, as well as a number of systems to explore and visualize the data.
Please see Analytics/Systems for a more comprehensive list of the systems we maintain.
|System name and link||Type||Accessibility|
|Archiva||Repository for Java archives||Private|
|AQS - Analytics Query Service||REST API for analytics data||Public|
|Clients (stat100X)||Analytics client nodes to access Hadoop and various services||Private|
|Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...)||Hadoop||Private|
|Dashiki||Framework for building dashboards||Public|
|Druid||Data storage engine optimized for exploratory analytics||Private|
|EventLogging||Ad-hoc streaming pipeline||Private|
|EventStreams||Mediawiki events streams||Public|
|Hue||Web interface for Hive, Oozie, and other Cluster services||Private|
|Kafka||Data transport and streaming system||Private|
|MariaDB||Data storage for MediaWiki replicas and EventLogging||Private|
|Matomo (formerly known as Piwik)||Small-scale web analytics platform||Private|
|Presto||Big data high performance sql query engine||Private|
|Superset||Web interface for data visualization and exploration||Private|
|Jupyter||Hosted notebooks for data analysis||Private|
|Turnilo||Web interface for exploring data stored in Druid||Private|
|Wikistats (1 and 2)||Community Dashboard with high-level metrics||Public|
The list of scheduled manual maintenance tasks are documented here.
Please also refer to Analytics/Data Lake for more liks to reference material.
- Webrequests [Traffic logs] and derived tables, including:
- Mediawiki raw databases
- EventLogging (in the event database in hive)
- Edits history, Page history, User history
- Other reports
Please also refer to Analytics/Systems/Cluster for more reference information about the pipelines we manage.
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
- ORES scores