Analytics/Archive/2015 data warehouse experiments
This page contains historical information. It is probably no longer true.
This document describes our current use cases for the data warehouse. We need the data warehouse because the database schema that currently exists is not optimal for us to do intensive queries to calculate editing data, specially for large wikis. We found out that we simply can't calculate the data we need in a timely fashion.
Although we have some established initial use cases, we are signing up for a continuous effort to optimize and evolve with new use cases. New metrics and new ways of using the data warehouse are going to happen in the near future and we hope the schema we're building allows us to adapt easier than the vanilla mediawiki schema.
We made a rough draft of the schema so we can have something concrete to talk about.
Also, how to manage migrations in the warehouse via alembic: 
Our primary use cases deal with retrieval of editing and registration data for all wikis on a daily basis. Here are are three sample use cases, there are many more but roughly they all deal with the same data and do a slightly different set of computations.
Newly Registered Users
Thorough description here
Newly registered users for September 1st for enwiki:
select count(*) from user where user_registration between '2014-09-01 00:00:00' and '2014-09-01 23:59:59' and wiki='enwiki' and registration_type='create';
Rolling recurring old active editors
Thorough description here
Note that this metric looks back 60 days when it comes to registration data and 30 days for editing data.
Rolling recurring old active editors for enwiki as of September 1st
select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time between '2014-07-01' and '2014-09-01' and user.user_registration < '2014-07-01' and user.in_bot_user_group = 0 and edit.wiki = 'enwiki' group by user_id having sum(if(time <= '2014-08-01', 1, 0)) >= 5 and sum(if(time > '2014-08-01', 1, 0)) >= 5 order by null;
analytics-store warehouse: 2 min 45 sec
Rolling new active editors
Thorough description here
Rolling new active editors for enwiki as of September 1st Note that this metric looks back 30 days in registration data and edits
select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and user.user_registration between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and user.in_bot_user_group = 0 and edit.wiki = 'enwiki' group by user_id having count(*) > 5 order by null;
analytics-store warehouse: 4 sec
Daily Pages Created
Note that this metric also counts deleted pages and pages in any namespace. See detail description here
Pages created on enwiki
select count(page_id) from page where wiki='enwiki' and page_is_redirect = 0 and valid_from between '2014-10-27 00:00:00' and '2014-10-27 23:59:59'; -- optional: and (@includeArchived = 1 or archived = 0) -- where @includeArchived would be passed in a parameterized query
Initial Data Verifications
Data verifications are moved to their own pages, for readability:
- December 2nd, 2014: Analytics/DataWarehouse/DataVerifications/2014-12-02
- January 14th, 2015: Analytics/DataWarehouse/DataVerifications/2015-01-14
- February 3rd, 2015: Analytics/DataWarehouse/DataVerifications/2015-02-03
Sean has used alembic changes to modify schema on warehouse and started the backfilling again this time from 2010 for enwiki and dewiki. He estimates this will take 24hrs to complete.
After backfilling is done analytics team is to -again- vet the data and see that the bugs we found prior are fixed.
If vetting is positive: Sean is to use repurposed hardware to setup the warehouse and we shall merge the alembic changes. There is no need to request hardware as there is hardware available.
- Analytics vets data, indexes are created
- Backfilling data to 2007
- Decide on usage of warehouse from labs? Can we run queries on prod and just "move" results to labs (so there is no need to optimize data gathering on labs)
- Found things that need correction while doing the data vetting. We agreed on exploring alembic as a way to manage migration of fields. We will exploring alembic as a tool that manages schema changes. If things work we will be adding alembic changes to the changeset un which we currently have the warehouse schema.
- Once changes are done data needs to be vetted again, can we have some automated way of doing this vetting? unit tests? (nuria: sounds more like monitoring than unit tests)
- Sean brought up that in the near term there is no way we could replicate the whole data set for the warehouse to labs in the short term, given labs constrains with space (Side note: replicating some of Event Logging data to labs might pave the way a bit on this process) He suggests that 1) we compute metrics in prod and move those to labs 2) we move wikimetrics to prod
- Sean to test alembic changes here to do alters of warehouse schema: https://gerrit.wikimedia.org/r/#/c/177739/
- Sean to reload datawarehouse data
- Nuria to vet data once loaded again
- Nuria and Dan to investigate testing/vetting of data on an automated fashion
- Nuria and Dan to research an automated way of loading data (kettle, pentanjo stack)
- Dan and Marcel will work on the ETL scripts by loading a small time slice from enwiki and dewiki into staging. When this is complete, they will merge the verified scripts into the data-warehouse repository, and Sean will use them to run over all the data.
- Dan will create a script that adds indices to the Edit schema's event logging table in the log database, to facilitate analysis for the VE team. Once merged in the data-warehouse repository, Sean will apply it to the table.
- Nuria is working on removing the auto-increment id field from Event Logging, work in progress. An interesting FYI is that different variants of UUIDs are possible, with different characteristics that may be useful in the future: http://en.wikipedia.org/wiki/Universally_unique_identifier#Variants_and_versions