Analytics/Systems/Cluster/Mediawiki History Snapshot Check
The almost same process is applied to each of the 4 datasets, with some grouping:
- jobs for
mediawiki_historyare collocated in the same oozie job. The code is in
oozie/mediawiki/history/check_denormalizefolde). The validation step happens for user, page and denormalized datasets in that order, and triggers adding the new snapshots in hive tables if data validates.
- The job for
mediawiki_history_reducedis part of the dataset generation and druid indexation, preventing druid indexation if the generated data doesn't validate.
How do we check?
The main idea is to compare a newly computed snapshot with a previously computed one used as reference (and therefore assumed correct). Each of the four dataset to validate has one or more event-entity, one or more event-types and many wikis. Each of the datasets also have a set of dimensions that can used to for certain event-entities and types. For event entities
revision a new snapshot should always contains more event than the reference snapshot. This is enforced by the fact that no deletion occur in those event spaces (revisions come from the revision and archive tables). The
page event entity is different as pages can be deleted, and we don't yet have the deleted pages events present in the dataset. When a page is deleted, every event that occured to this page disappears from the new snapshot, even if it was in the past. This leads to some wikis having a decreasing number of page events from a snapshot to new one, particularly for small wikis where variability in term of number of events is very high. To prevent the variability problem to affect too much the result, we use only the top wikis in term of events-acticity to validate the snapshots.
new snapshot (dataset),
previous snapshot (dataset),
X (for top X wikis),
minGrowth (minimum growth as a ratio),
maxGrowth (maximum growth as a ratio),
maxErrors (maximum ratio of rows in error)
- New and previous snapshots datasets are aggregated, grouping by
event_type, and using
SUM(IF(value, 1, 0))and
SUM(value)depending on the value type to aggregate.
- The top
Xwikis by number of events is computed, and the the aggregated datasets are filtered to only keep values for those top wikis.
- The aggregated-and-filtered datasets are joint on
event_typeequality, and the growth of every metric is computed as:
(new_agg_value - prev_agg_value) / prev_agg_value(values are coaleced to prevent nulls and divisions by 0)
- The joint dataset is then filtered, keeping only error-rows having values not being comprised between
- Finally, we write the errors (meaning we report them) if the ratio
(number of error-rows) / (number of metric rows)is higher than
What do we check?
After some trial an error, we currently use:
- The top 50 wikis in number of events computed for the given dataset. Smaller wikis might have too much variability for us to check events (
X = 50).
minGrowth = -0.01-- We accept metrics showing a very small loss (less than 1%) - This covers the cases of data being deleted for PII reasons (see )
maxGrowth = 1.0-- We accept up to 100% growth in metrics - This threshold is the one that is strongly correlated with
maxErrors = 0.05-- While we'd rather accept only a very small number of rows in error, the
pageevent-entity problem described above prevents us from having this threshold lower. For
X = 50and 4 event-types (as in pages), we expect 200 metric rows, and we accept 10 being in error.
- There are special cases where deletions are manually enforced to prevent PII leaking. Those cases are seldom enough not to trigger errors based on the threasholds we use (see the Thresholds definition section).
- The page_is_redirect dimension is a special case here, as it is not historified (we don't have historical values yet, as it involves parsing the whole revision-text history). Because it is not historified but still reported back in time (we propagate it backwards), the measure is not growth but variability, and can be negative as well as positive. For this special case we use -maxGrowth as the lower threshold instead of minGrowth.