Data Platform/Systems/Mediawiki History Snapshot Check

This page describes the validation step used to check mediawiki_user_history, mediawiki_page_history and mediawiki_history and mediawiki_history_reduced datasets.

Jobs organization

The almost same process is applied to each of the 4 datasets, with some grouping:

jobs for mediawiki_user_history, mediawiki_page_history and mediawiki_history are collocated in the same oozie job. The code is in refinery repository, oozie/mediawiki/history/check_denormalize folde). The validation step happens for user, page and denormalized datasets in that order, and triggers adding the new snapshots in hive tables if data validates.
The job for mediawiki_history_reduced is part of the dataset generation and druid indexation, preventing druid indexation if the generated data doesn't validate.

How do we check?

Main concepts

The main idea is to compare a newly computed snapshot with a previously computed one used as reference (and therefore assumed correct). Each of the four dataset to validate has one or more event-entity, one or more event-types and many wikis. Each of the datasets also have a set of dimensions that can used to for certain event-entities and types. For event entities user and revision a new snapshot should always contains more event than the reference snapshot. This is enforced by the fact that no deletion occur in those event spaces (revisions come from the revision and archive tables)^[1]. The page event entity is different as pages can be deleted, and we don't yet have the deleted pages events present in the dataset. When a page is deleted, every event that occured to this page disappears from the new snapshot, even if it was in the past. This leads to some wikis having a decreasing number of page events from a snapshot to new one, particularly for small wikis where variability in term of number of events is very high. To prevent the variability problem to affect too much the result, we use only the top wikis in term of events-acticity to validate the snapshots.

Algorithm

Parameters: new snapshot (dataset), previous snapshot (dataset), X (for top X wikis), minGrowth (minimum growth as a ratio), maxGrowth (maximum growth as a ratio), maxErrors (maximum ratio of rows in error)

New and previous snapshots datasets are aggregated, grouping by wiki_db, event_entity and event_type, and using COUNT(DISTINCT value), SUM(IF(value, 1, 0)) and SUM(value) depending on the value type to aggregate.
The top X wikis by number of events is computed, and the the aggregated datasets are filtered to only keep values for those top wikis.
The aggregated-and-filtered datasets are joint on wiki_db, event_entity and event_type equality, and the growth of every metric is computed as: (new_agg_value - prev_agg_value) / prev_agg_value (values are coaleced to prevent nulls and divisions by 0)
The joint dataset is then filtered, keeping only error-rows having values not being comprised between minGrowth and maxGrowth ^[2].
Finally, we write the errors (meaning we report them) if the ratio (number of error-rows) / (number of metric rows) is higher than maxErrors.

What do we check?

After some trial an error, we currently use:

The top 50 wikis in number of events computed for the given dataset. Smaller wikis might have too much variability for us to check events (X = 50).
minGrowth = -0.01 -- We accept metrics showing a very small loss (less than 1%) - This covers the cases of data being deleted for PII reasons (see ^[1])
maxGrowth = 1.0 -- We accept up to 100% growth in metrics - This threshold is the one that is strongly correlated with X.
maxErrors = 0.05 -- While we'd rather accept only a very small number of rows in error, the page event-entity problem described above prevents us from having this threshold lower. For X = 50 and 4 event-types (as in pages), we expect 200 metric rows, and we accept 10 being in error.

↑ ^1.0 ^1.1 There are special cases where deletions are manually enforced to prevent PII leaking. Those cases are seldom enough not to trigger errors based on the threasholds we use (see the Thresholds definition section).
↑ The page_is_redirect dimension is a special case here, as it is not historified (we don't have historical values yet, as it involves parsing the whole revision-text history). Because it is not historified but still reported back in time (we propagate it backwards), the measure is not growth but variability, and can be negative as well as positive. For this special case we use -maxGrowth as the lower threshold instead of minGrowth.

[:0-1] 1.0 ^1.1 There are special cases where deletions are manually enforced to prevent PII leaking. Those cases are seldom enough not to trigger errors based on the threasholds we use (see the Thresholds definition section).

[2] The page_is_redirect dimension is a special case here, as it is not historified (we don't have historical values yet, as it involves parsing the whole revision-text history). Because it is not historified but still reported back in time (we propagate it backwards), the measure is not growth but variability, and can be negative as well as positive. For this special case we use -maxGrowth as the lower threshold instead of minGrowth.

[1]

[2]