Analytics

From Wikitech
Jump to: navigation, search

The Wikimedia Foundation's Analytics Engineering team is part of the Technology department.

The Analytics Engineering Team sees as its primary responsibility making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders.

We develop infrastructure so all our users, both within the Foundation and within the different communities, can access data in a self-service fashion that is consistent with the values of the movement.

About us - Analytics/Team

Contact

To reach the team, please email its public mailing listanalytics@lists.wikimedia.org (subscribe, archives).

Work organization

The analytics team uses Phabricator to track its projects.

Prioritization

July 2017 to July 2018

We will be continuing our work on wikistats 2.0 regarding the UI and backend. While we believe the community will benefit from this work immediately we also want to provide a more raw version of the new edit data lake available on labs so it can be used as a data source for edit data and as a backend for tools such as quarry.

We will also be working on the privacy of our data and retaking our efforts regarding anonymization of pageview datasets. Another area of focus will be bot identification of pageview data (we know we under report those) and the migration off MariaDB as backend storage for eventlogging, this last project is of benefit just to WMF.

July 2016 to July 2017

After the efforts last year around getting and organizing data about readers that culminated with the launch of the Pageview API and the computation of unique devices for wikimedia projects our top priority this year is to create a data depot for editing data. In other words data first, tools second. It is worth noting that despite wikimedia projects being driven by an editing ecosystem we do not have a pool of edit data similar to the one we have for pageview-based data: analytics-friendly and easily accessible. This is part of the project we call Wikistats 2.0, in which we will be revamping the frontend and backend of http://stats.wikimedia.org. The creation of a new data depot for edit data will improve (and fundamentally change) the way, time and resources it takes to calculate edit metrics, for WMF and community. We will be focusing first on the data and second on the tools and UIs to consume it, we expect this project will take a bit longer than a year.

The next set of priorities have to do with improving tools for data display for the Foundation and community, we will be focusing on creating a better visual interface for our pageview data that can be used by non technical users. We will expand the data that is available visually gradually through the year.

Our top priority is always operational, we need to keep lights up and maintain the current level of service in the analytics stack, which includes kafka, hadoop cluster, eventlogging pageview api and others.

Datasets

We maintain various datasets, and we provide two ways to access them:

By access system

By data type

  • Webrequests [Traffic logs]
  • Pageviews [Filtered traffic logs] [TODO - Revamp and add various systems and key differences in schema and usage]
  • Mediawiki databases
  • EventLogging
  • Edits history
  • Other reports

Systems - Analytics/Systems

We maintain various systems to allow querying of our datasets in different fashion.

System name and link Type Accessibility
Cluster (Hadoop, Camus, Hive, Oozie, Spark...) Hadoop Private
AQS - Analytics Query Service REST API Public
Druid - Fast OLAP API + User Interface Private
EventLogging Ad-hoc streaming pipeline Private
EventStreams Mediawiki events streams Public
ReportUpdater Job Scheduler Private
Wikimetrics Ad-hoc tools (FRTech) Private
Piwik Web Analytics (small scale) Private
Archiva Jar repository Private
Kafka Distributed log Private
Dashiki Dashboarding Public

Try it out! Analytics/Tutorials

We'd rather have you having fun with our data :)

Please check the link above for something that might help you, and let us know if you don't find what you're after.

Table of Content

Go to the Analytics/TOC page to have a list of all pages we have under Analytics.