The Wikimedia Foundation's Data Engineering team is part of the Technology department.
We provide the Wikimedia analytics data platform, making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders. We develop infrastructure so all our users can access data in a self-service fashion that is consistent with the values of the movement.
We keep all our documentation here on Wikitech. See also this FAQ.
About us - Data Engineering/Team
Our team provides a self-service, privacy-aware data platform that empowers people to gain data-driven insights and build better product experiences for Wikimedia communities.
If you have questions about our work or the infrastructure we provide, you can contact us in two ways:
- on our public mailing list, email@example.com (subscribe, archives)
- in our public IRC channel, #wikimedia-analytics connect. You can use the keyword a-team to ping us, so we notice your question.
The analytics team uses Phabricator to track its projects.
- https://phabricator.wikimedia.org/tag/data-engineering/ for backlog triage
- https://phabricator.wikimedia.org/tag/data-engineering-kanban/ for in progress tasks
- Webrequests [Traffic logs] and derived tables, including:
- Mediawiki raw databases
- EventLogging (in the event database in hive)
- Edits history, Page history, User history
- Other reports
We maintain the big data platform including the data lake, ingestion and processing pipelines, as well as a number of systems to explore and visualize the data.
|System name and link||Type||Accessibility|
|Archiva||Repository for Java archives||Private|
|AQS - Analytics Query Service||REST API for analytics data||Public|
|Clients (stat100X)||Analytics client nodes to access Hadoop and various services||Private|
|Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...)||Hadoop||Private|
|Dashiki||Framework for building dashboards||Public|
|Druid||Data storage engine optimized for exploratory analytics||Private|
|EventLogging||Ad-hoc streaming pipeline||Private|
|EventStreams||Mediawiki events streams||Public|
|Hue||Web interface for Hive, Oozie, and other Cluster services||Private|
|Kafka||Data transport and streaming system||Private|
|MariaDB||Data storage for MediaWiki replicas and EventLogging||Private|
|Matomo (formerly known as Piwik)||Small-scale web analytics platform||Private|
|Presto||Big data high performance sql query engine||Private|
|Superset||Web interface for data visualization and exploration||Private|
|Jupyter||Hosted notebooks for data analysis||Private|
|Turnilo||Web interface for exploring data stored in Druid||Private|
|Wikistats (1 and 2)||Community Dashboard with high-level metrics||Public|
The list of scheduled manual maintenance tasks are documented here.
Try it out! Analytics/Tutorials
We'd rather have you having fun with our data :)
Please check the link above for something that might help you, and let us know if you don't find what you're after.
Table of Content
Go to the Analytics/TOC page to have a list of all pages we have under Analytics.