Data Platform/Systems/Mediawiki history reduced algorithm

This page describes the algorithm we devised to generate the mediawiki_history_reduced dataset, a dataset optimized for Druid to serve Data Platform/AQS requests.

Reduction and filtering optimization

The denormalized mediawiki_history table contains a lot of fields, among which only a small number are useful for Wikistats 2 API. We therefore drop a lot of those fields, and also reorganise them to take advantage of columnar-indexing of druid (basically, better to have an array of tags than plenty of booleans):

- user_type can be one of anonymous, group_bot, name_bot (users having a name that looks like a bot) or user.

- page_type can be one oif content or non_content depending on the page namespace.

- other_tags can contain any of deleted, reverted or revert for revisions; self_created or user_first_24_hours for users (the later is whether a user has made an edit within 24 hours after registration)

Augmentation with pre-computed new events

From the metrics needed for Wikistas 2, one type in particular was causing problems for druid: getting the number of pages or users filtered by their edit activity (e.g. their number of edits). For instance an active editor is calssically a user having made 5 edits or more one a given month. This type of metric is problematic natively for druid because it involves grouping edits by user or page and counting their number of events, then filtering the users or pages not having an appropriate edit-count, and finally count how many users or pages are left. While this seems reasonably simple -- and it is for most query systems -- the dimension cardinality we are dealing with in terms of distinct number of pages or users per month, in addition o the relatively small size of the cluster, makes it a difficult task for druid to handle correctly. So we cheated :)

We created a new event type called digest. We generate digests daily and monthly that store precomputed number of edits per user or page (the field is named revisions), denormalized over the dimensions of interest (namely user_type and page_type).

Those represent a big number of new events, but since we're not interested in keeping the actual user_id or page_id of the user having made the edits, those events are well suited for druid to pre-aggregate them. Finaly, when looking at digest events, we know that individual events having the revisions count field matching 5 or more is our number of active editors.

Notice: Since we need to denormalize over user_type and page_type, we need to explicitly set user_type = 'all' or page_type = 'all' in case no filter along those dimensions is specified. This denormalization is the price to pay for precomputing non-additive metrics.