This page describe the mediawiki history reduced dataset. It lives in Analytic's Hadoop cluster and in druid public cluster, and is accessible via the Hive/Beeline external table wmf.mediawiki_history_reduced. It is a transformation of the mediawiki history dataset making it smaller and reshaping it to allow for druid fast-querying for AQS Wikistats 2 queries (see Analytics/Systems/Cluster/Mediawiki history reduced algorithm). As its parent, this dataset is updated every month, with a new snapshot=YYYY-MM partition added to hive, and a new datasource mediawiki_history_reduced_YYYY_MM added to druid. It is important to notice that snapshots are NOT incremental, so when querying the table you should always specify one. Also visit Analytics/Data access if you don't know how to access this data set.


You can get the canonical version of the schema by running describe wmf.mediawiki_history_reduced; from the beeline command line.

Note that the snapshot field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
project string The project this event belongs to (en.wikipedia or wikidata for instance)
event_entity string revision, user or page
event_type string create, move, delete, etc with specific digest types. Detailed explanation in the docs under #Event_types
event_timestamp string When this event ocurred
user_id string The user_id if the user is registered, user_text otherwise (IP) of the user performing the event
user_type string anonymous, group_bot, name_bot or user
page_id bigint The page_id of the event
page_namespace int The page namespace of the event
page_type string content or non_content based on namespace being in content space or not
other_tags array<string> Can contain: deleted (and deleted_day, deleted_month, deleted_year if deleted within the given time period), revetered and revert (for revisions), self_created (for users), user_first_24_hours if a revision is made during the first 24 hours of a user registration, redirect (for pages)
text_bytes_diff bigint The text-bytes difference of the event (or sum in case of digests)
text_bytes_diff_abs bigint The absolute value of text-bytes difference for the event (or sum in case of digests)
revisions bigint 1 if the event is entity revision, or sum of revisions in case of digests
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Important Fields

Due to the denormalization of the history data, filtering by event_entity is mandatory not to mix incompatible data.

Similarly, event_types filtering can/must be used depending of the analysis.

Entity Event type Meaning
revision create When a revision is created, when an edit happens.
page create When the first edit to a page is done.
move When moving a page, changing its title.
delete When deleting a page (no occurrence as of now due to a bug in history reconstruction)
daily_digest Daily pre-computation (with dimension explosion) facilitating by-page activity level filtering
monthly_digest Monthly pre-computation (with dimension explosion) facilitating by-page activity level filtering
user create When a new user is registered.
rename When the name of a user is changed.
altergroups When the groups (rights) of a user are changed.
alterblocks When the blocks of a user are changed.
daily_digest Daily pre-computation (with dimension explosion) facilitating by-user activity level filtering
monthly_digest Monthly pre-computation (with dimension explosion) facilitating by-user activity level filtering

Changes and known problems

Snapshot version Details
To come Task T200270 2018-06 Update page_namespace field to be an int - Previous snapshots updated
2018-06-21 Task 192483 2018-06 Make table use parquet storage instead of json (made possible thanks to druid-parquet extension) - previous snapshots backfilled
2018-04-01 Task T192482 2018-04 Make the table permanent and available in Hive (was temporary before)