The Wikimedia Foundation's Analytics Engineering team is part of the Technology department.
The Analytics Engineering Team's primary responsibility is to "empower and support data informed decision making across the Foundation and the Community".
We make Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders.
We develop infrastructure so all our users, both within the Foundation and within the different communities, can access data in a self-service fashion that is consistent with the values of the movement.
We keep all our documentation here on Wikitech.
About us - Analytics/Team
If you have questions about our work or the infrastructure we provide, you can contact us in two ways:
- on our public mailing list, email@example.com (subscribe, archives)
- in our public IRC channel, #wikimedia-analytics. You can use the keyword a-team to ping us, so we notice your question.
The analytics team uses Phabricator to track its projects.
- https://phabricator.wikimedia.org/tag/analytics/ for backlog triage
- https://phabricator.wikimedia.org/tag/analytics-kanban/ for in progress tasks
July 2017 to July 2018
We will be continuing our work on wikistats 2.0 regarding the UI and backend. While we believe the community will benefit from this work immediately we also want to provide a more raw version of the new edit data lake available on labs so it can be used as a data source for edit data and as a backend for tools such as quarry.
We will also be working on the privacy of our data and retaking our efforts regarding anonymization of pageview datasets. Another area of focus will be bot identification of pageview data (we know we under report those) and the migration off MariaDB as backend storage for eventlogging, this last project is of benefit just to WMF.
July 2016 to July 2017
After the efforts last year around getting and organizing data about readers that culminated with the launch of the Pageview API and the computation of unique devices for wikimedia projects our top priority this year is to create a data depot for editing data. In other words data first, tools second. It is worth noting that despite wikimedia projects being driven by an editing ecosystem we do not have a pool of edit data similar to the one we have for pageview-based data: analytics-friendly and easily accessible. This is part of the project we call Wikistats 2.0, in which we will be revamping the frontend and backend of http://stats.wikimedia.org. The creation of a new data depot for edit data will improve (and fundamentally change) the way, time and resources it takes to calculate edit metrics, for WMF and community. We will be focusing first on the data and second on the tools and UIs to consume it, we expect this project will take a bit longer than a year.
The next set of priorities have to do with improving tools for data display for the Foundation and community, we will be focusing on creating a better visual interface for our pageview data that can be used by non technical users. We will expand the data that is available visually gradually through the year.
Our top priority is always operational, we need to keep lights up and maintain the current level of service in the analytics stack, which includes kafka, hadoop cluster, eventlogging pageview api and others.
We maintain various datasets, and we provide two ways to access them:
By access system
- Data Lake [Hadoop cluster]
- AQS - Analytics Query Service [TODO - Create new page instead of redirect with Systems]
- Druid and Turnilo (formerly Pivot)
- ReportUpdater reports
- Wikistats 2
- Ad hoc datasets published with documentation by researchers and Analysts at WMF
By data type
- Webrequests [Traffic logs]
- Pageviews [Filtered traffic logs] [TODO - Revamp and add various systems and key differences in schema and usage]
- Mediawiki raw databases
- Edits history, Page history, User history
- Other reports
- Inter-language [Traffic between different languages of the same project family]
Systems - Analytics/Systems
We maintain various systems to allow querying of our datasets in different fashion.
|System name and link||Type||Accessibility|
|Cluster (Hadoop, Camus, Hive, Oozie, Spark...)||Hadoop||Private|
|AQS - Analytics Query Service||REST API||Public|
|Druid - Fast OLAP||API + User Interface||Private|
|EventLogging||Ad-hoc streaming pipeline||Private|
|EventStreams||Mediawiki events streams||Public|
|Wikimetrics||Ad-hoc tools (FRTech)||Private|
|Piwik||Web Analytics (small scale)||Private|
|Wikistats (1 and 2)||Community Dashboard with top level metrics||Public|
Try it out! Analytics/Tutorials
We'd rather have you having fun with our data :)
Please check the link above for something that might help you, and let us know if you don't find what you're after.
Table of Content
Go to the Analytics/TOC page to have a list of all pages we have under Analytics.