Data Platform/Data Lake/Edits/Mediawiki history reduced
This page describe the mediawiki history reduced dataset. It lives in Analytic's Hadoop cluster and in druid public cluster, and is accessible via the Hive/Beeline external table wmf.mediawiki_history_reduced
. It is a transformation of the mediawiki history dataset making it smaller and reshaping it to allow for druid fast-querying for AQS queries (see Analytics/Systems/Cluster/Mediawiki history reduced algorithm). As its parent, this dataset is updated every month, with a new snapshot=YYYY-MM
partition added to hive, and a new datasource mediawiki_history_reduced_YYYY_MM
added to druid. It is important to notice that snapshots are NOT incremental, so when querying the table you should always specify one. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_history_reduced;
from the beeline command line.
Note that the snapshot
field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
project | string | The project this event belongs to (en.wikipedia or wikidata for instance) |
event_entity | string | revision, user or page |
event_type | string | create, move, delete, etc with specific digest types. Detailed explanation in the docs under #Event_types |
event_timestamp | string | When this event ocurred |
user_id | string | The user_id if the user is registered, user_text otherwise (IP) of the user performing the event |
user_type | string | anonymous, group_bot, name_bot or user |
page_id | bigint | The page_id of the event |
page_namespace | int | The page namespace of the event |
page_type | string | content or non_content based on namespace being in content space or not |
other_tags | array<string> | Can contain: deleted (and deleted_day, deleted_month, deleted_year if deleted within the given time period), revetered and revert (for revisions), self_created (for users), user_first_24_hours if a revision is made during the first 24 hours of a user registration, redirect (for pages) |
text_bytes_diff | bigint | The text-bytes difference of the event (or sum in case of digests) |
text_bytes_diff_abs | bigint | The absolute value of text-bytes difference for the event (or sum in case of digests) |
revisions | bigint | 1 if the event is entity revision, or sum of revisions in case of digests |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Important Fields
Due to the denormalization of the history data, filtering by event_entity
is mandatory not to mix incompatible data.
Similarly, event_types
filtering can/must be used depending of the analysis.
Entity | Event type | Meaning |
---|---|---|
revision | create | When a revision is created, when an edit happens. |
page | create | When the first edit to a page is done. |
move | When moving a page, changing its title. | |
delete | When deleting a page (no occurrence as of now due to a bug in history reconstruction) | |
daily_digest | Daily pre-computation (with dimension explosion) facilitating by-page activity level filtering | |
monthly_digest | Monthly pre-computation (with dimension explosion) facilitating by-page activity level filtering | |
user | create | When a new user is registered. |
rename | When the name of a user is changed. | |
altergroups | When the groups (rights) of a user are changed. | |
alterblocks | When the blocks of a user are changed. | |
daily_digest | Daily pre-computation (with dimension explosion) facilitating by-user activity level filtering | |
monthly_digest | Monthly pre-computation (with dimension explosion) facilitating by-user activity level filtering |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2019-04 | task T221824 | 2019-04 | No schema update. Change in the way data is computed: all events belonging to deleted pages are removed from the dataset (page events before deletion, revisions). This impacts mostly editors metric and top metrics as those were not filtering out deleted events. Also, a change in the way anonmous information is gathered in the original dataset makes new-pages metric now have anonymous values. |
To come | task T200270 | 2018-06 | Update page_namespace field to be an int - Previous snapshots updated |
2018-06-21 | task T192483 | 2018-06 | Make table use parquet storage instead of json (made possible thanks to druid-parquet extension) - previous snapshots backfilled |
2018-04-01 | task T192482 | 2018-04 | Make the table permanent and available in Hive (was temporary before) |