Jump to content

Data Platform/Data Lake/Edits/Mediawiki history reduced

From Wikitech

This page describe the mediawiki history reduced dataset. It lives in Analytic's Hadoop cluster and in druid public cluster, and is accessible via the Hive/Beeline external table wmf.mediawiki_history_reduced. It is a transformation of the mediawiki history dataset making it smaller and reshaping it to allow for druid fast-querying for AQS queries (see Analytics/Systems/Cluster/Mediawiki history reduced algorithm). As its parent, this dataset is updated every month, with a new snapshot=YYYY-MM partition added to hive, and a new datasource mediawiki_history_reduced_YYYY_MM added to druid. It is important to notice that snapshots are NOT incremental, so when querying the table you should always specify one. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

You can get the canonical version of the schema by running describe wmf.mediawiki_history_reduced; from the beeline command line.

Note that the snapshot field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
project string The project this event belongs to (en.wikipedia or wikidata for instance)
event_entity string revision, user or page
event_type string create, move, delete, etc with specific digest types. Detailed explanation in the docs under #Event_types
event_timestamp string When this event ocurred
user_id string The user_id if the user is registered, user_text otherwise (IP) of the user performing the event
user_type string anonymous, group_bot, name_bot or user
page_id bigint The page_id of the event
page_namespace int The page namespace of the event
page_type string content or non_content based on namespace being in content space or not
other_tags array<string> Can contain: deleted (and deleted_day, deleted_month, deleted_year if deleted within the given time period), revetered and revert (for revisions), self_created (for users), user_first_24_hours if a revision is made during the first 24 hours of a user registration, redirect (for pages)
text_bytes_diff bigint The text-bytes difference of the event (or sum in case of digests)
text_bytes_diff_abs bigint The absolute value of text-bytes difference for the event (or sum in case of digests)
revisions bigint 1 if the event is entity revision, or sum of revisions in case of digests
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Important Fields

Due to the denormalization of the history data, filtering by event_entity is mandatory not to mix incompatible data.

Similarly, event_types filtering can/must be used depending of the analysis.

Entity Event type Meaning
revision create When a revision is created, when an edit happens.
page create When the first edit to a page is done.
move When moving a page, changing its title.
delete When deleting a page (no occurrence as of now due to a bug in history reconstruction)
daily_digest Daily pre-computation (with dimension explosion) facilitating by-page activity level filtering
monthly_digest Monthly pre-computation (with dimension explosion) facilitating by-page activity level filtering
user create When a new user is registered.
rename When the name of a user is changed.
altergroups When the groups (rights) of a user are changed.
alterblocks When the blocks of a user are changed.
daily_digest Daily pre-computation (with dimension explosion) facilitating by-user activity level filtering
monthly_digest Monthly pre-computation (with dimension explosion) facilitating by-user activity level filtering

Changes and known problems

Date Phab

Task

Snapshot version Details
2019-04 task T221824 2019-04 No schema update. Change in the way data is computed: all events belonging to deleted pages are removed from the dataset (page events before deletion, revisions). This impacts mostly editors metric and top metrics as those were not filtering out deleted events. Also, a change in the way anonmous information is gathered in the original dataset makes new-pages metric now have anonymous values.
To come task T200270 2018-06 Update page_namespace field to be an int - Previous snapshots updated
2018-06-21 task T192483 2018-06 Make table use parquet storage instead of json (made possible thanks to druid-parquet extension) - previous snapshots backfilled
2018-04-01 task T192482 2018-04 Make the table permanent and available in Hive (was temporary before)