Analytics/Data Lake/Edits/MediaWiki history
This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in the Analytics Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_history
. It also can be downloaded as a dump from dumps.wikimedia.org. For more details on how the dataset is built, see Analytics/Systems/Data Lake/Edits/Pipeline, and more precisely Analytics/Systems/Data Lake/Edits/Pipeline/Revision augmentation and denormaliztion. A new monthly snapshot is being produced at the beginning of each month; to check whether it is ready to be queried, one can view the status of the mediawiki-history-denormalize-coord Oozie job (jobs for specific tables can be found via https://hue.wikimedia.org/oozie/list_oozie_coordinators/ ). Also visit Analytics/Data access if you don't know how to access this data set.
Sample
To help understand and discover the data, we made a sample table and loaded it into mysql on analytics-store, in the staging database. It has 2001-2017 for two wikis and 16 hours of data for all wikis, to enable testing of queries and getting a feel for the data. The column types are different because of differences between hive and mysql. For example there are no array types in mysql, so fields like event_user_groups are just a comma separated list in this sample table. You can find out more with:
mysql:research@analytics-store.eqiad.wmnet [staging]> describe mediawiki_history_sample;
Public version
This data is published as a collection of files on our dumps infrastructure: Analytics/Data_Lake/Edits/Mediawiki_history_dumps.
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_history
from the beeline command line.
Note that the snapshot
field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where
clause of your query. The snapshot
field is the only field that is not present in the dump version of this data set, because it's implied in the file URL.
In the documentation below, "current" refers to the time of the snapshot, and "historical" to the time of the event.
More precise information on the notion of identity-revert can be found on this page.
col_name | data_type | comment |
---|---|---|
wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
event_entity | string | revision, user or page |
event_type | string | create, move, delete, etc. Detailed explanation in the docs under #Event_types |
event_timestamp | string | When this event occurred. Formatted like "2020-01-03 08:31:12.0". |
event_comment | string | Comment related to this event, sourced from log_comment, rev_comment, etc. |
event_user_id | bigint | ID of the user that caused the event. Null if the user is anonymous or if from a revision where the user has been revision deleted. |
event_user_text_historical | string | Historical username (IP address for anonymous user) of the user that caused the event. Null for revisions where the user has been revision deleted. |
event_user_text | string | Current username of the user that caused the event. Null for anonymous users (the IP is stored in event_user_text_historical). Null for revisions where the user has been revision deleted. |
event_user_blocks_historical | array<string> | Historical blocks of the user that caused the event |
event_user_blocks | array<string> | Current blocks of the user that caused the event |
event_user_groups_historical | array<string> | Historical groups of the user that caused the event |
event_user_groups | array<string> | Current groups of the user that caused the event |
event_user_is_bot_by_historical | array<string> | Historical bot information of the user that caused the event, can contain values name or group |
event_user_is_bot_by | array<string> | Bot information of the user that caused the event, can contain values name or group |
event_user_is_created_by_self | boolean | Whether the event_user created their own account |
event_user_is_created_by_system | boolean | Whether the event_user account was created by mediawiki (eg. centralauth) |
event_user_is_created_by_peer | boolean | Whether the event_user account was created by another user |
event_user_is_anonymous | boolean | Whether the event_user is not registered. Null for revisions where the user has been revision deleted. |
event_user_registration_timestamp | string | Registration timestamp of the user that caused the event (from user table) |
event_user_creation_timestamp | string | Creation timestamp of the user that caused the event (from logging table) |
event_user_first_edit_timestamp | string | Timestamp of the first edit of the user that caused the event |
event_user_revision_count | bigint | Number of revisions made by the event_user up to the historical time in this wiki_db (only available in revision-create events so far). For revision-create events, this includes the event itself.
|
event_user_seconds_since_previous_revision | bigint | In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far) |
page_id | bigint | In revision/page events: id of the page |
page_title_historical | string | In revision/page events: historical title of the page |
page_title | string | In revision/page events: current title of the page |
page_namespace_historical | int | In revision/page events: historical namespace of the page. |
page_namespace_is_content_historical | boolean | In revision/page events: historical namespace of the page is categorized as content |
page_namespace | int | In revision/page events: current namespace of the page |
page_namespace_is_content | boolean | In revision/page events: current namespace of the page is categorized as content |
page_is_redirect | boolean | In revision/page events: whether the page is currently a redirect |
page_is_deleted | boolean | In revision/page events: Whether the page is rebuilt from a delete event |
page_creation_timestamp | string | In revision/page events: creation timestamp of the page |
page_first_edit_timestamp | string | In revision/page events: timestamp of the page's first revision. Can be before the page_creation in some restore/merge cases (see revision_is_from_before_page_creation). |
page_revision_count | bigint | In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far) |
page_seconds_since_previous_revision | bigint | In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far) |
user_id | bigint | In user events: id of the user |
user_text_historical | string | In user events: historical username or IP address of the user |
user_text | string | In user events: current username or IP address of the user |
user_blocks_historical | array<string> | In user events: historical user blocks |
user_blocks | array<string> | In user events: current user blocks |
user_groups_historical | array<string> | In user events: historical user groups |
user_groups | array<string> | In user events: current user groups |
user_is_bot_by_historical | array<string> | In user events: Historical bot information of the user, can contain values name or group |
user_is_bot_by | array<string> | In user events: Bot information of the user, can contain values name or group |
user_is_created_by_self | boolean | In user events: whether the user created their own account |
user_is_created_by_system | boolean | In user events: whether the user account was created by mediawiki |
user_is_created_by_peer | boolean | In user events: whether the user account was created by another user |
user_is_anonymous | boolean | In user events: whether the user is not registered |
user_registration_timestamp | string | In user events: registration timestamp of the user. |
user_creation_timestamp | string | In user events: Creation timestamp of the user (from logging table) |
user_first_edit_timestamp | string | In user events: Timestamp of the first edit of the user |
revision_id | bigint | In revision events: id of the revision |
revision_parent_id | bigint | In revision events: id of the parent revision |
revision_minor_edit | boolean | In revision events: whether it is a minor edit or not |
revision_deleted_parts | array<string> | In revision events: Deleted parts of the revision, can contain values text, comment and user |
revision_deleted_parts_are_suppressed | boolean | In revision events: Whether the deleted parts are deleted to admin as well (visibleonly by stewarts) |
revision_text_bytes | bigint | In revision events: number of bytes of revision |
revision_text_bytes_diff | bigint | In revision events: change in bytes relative to parent revision (can be negative). |
revision_text_sha1 | string | In revision events: sha1 hash of the revision |
revision_content_model | string | In revision events: content model of revision |
revision_content_format | string | In revision events: content format of revision |
revision_is_deleted_by_page_deletion | boolean | In revision events: whether this revision has been deleted (moved to archive table) |
revision_deleted_by_page_deletion_timestamp | string | In revision events: the timestamp when the revision was deleted |
revision_is_identity_reverted | boolean | In revision events: whether this revision was reverted by another future revision |
revision_first_identity_reverting_revision_id | bigint | In revision events: id of the revision that reverted this revision |
revision_seconds_to_identity_revert | bigint | In revision events: seconds elapsed between revision posting and its revert (if there was one) |
revision_is_identity_revert | boolean | In revision events: whether this revision reverts other revisions |
revision_is_from_before_page_creation | boolean | In revision events: True if the revision timestamp is before the page creation (can happen with restore events) |
revision_tags | array<string> | In revision events: Tags associated with the revision. NULL if the revision has no tags. |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports). This field is the only one that is not present in the dumps version of the data set, because it is implied in the file URL. |
Important Fields
Due to the denormalization of the history data, filtering by event_entity
is mandatory not to mix incompatible data.
Similarly, event_types
filtering can/must be used depending of the analysis.
Entity | Event type | Meaning |
---|---|---|
revision | create | Making an edit |
page | create | Making the first edit to a page |
delete | Deleting a page | |
move | Changing a page's title | |
restore | Undeleting a page | |
user | create | The registration of a new account |
rename | Changing the name of a user | |
altergroups | The groups (rights) of a user were changed | |
alterblocks | The user was blocked or unblocked |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2022-06-01 | task T309987 | 2022-05 | Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets. |
2020-08 | task T259823 | 2020-06 | Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations |
2019-07 | task T221825 | 2019-05 | Schema changes:
Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate. |
2019-05 | task T221824 | 2019-04 | Schema changes:
Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463). |
2018-10 | task T209031 | 2018-10 and 2018-11 | due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here. |
2017-12 | 2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
Revisions happening before page-creation date (due to restore over existing page) are now correctly linked. History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release. | |
2017-06 | task T161147 | 2017-06 | Provide cumulative edit count |
2017-06 | task T170493 | 2017-06 | Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings) |
2016-10-06 | n/a | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017-03-01 | n/a | Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.
|