Analytics/Data Lake/Edits/MediaWiki history

From Wikitech
Jump to navigation Jump to search

This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in the Analytics Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_history. It also can be downloaded as a dump from dumps.wikimedia.org. For more details on how the dataset is built, see Analytics/Systems/Data Lake/Edits/Pipeline, and more precisely Analytics/Systems/Data Lake/Edits/Pipeline/Revision augmentation and denormaliztion. A new monthly snapshot is being produced at the beginning of each month; to check whether it is ready to be queried, one can view the status of the mediawiki-history-denormalize-coord Oozie job (jobs for specific tables can be found via https://hue.wikimedia.org/oozie/list_oozie_coordinators/ ). Also visit Analytics/Data access if you don't know how to access this data set.

Sample

To help understand and discover the data, we made a sample table and loaded it into mysql on analytics-store, in the staging database. It has 2001-2017 for two wikis and 16 hours of data for all wikis, to enable testing of queries and getting a feel for the data. The column types are different because of differences between hive and mysql. For example there are no array types in mysql, so fields like event_user_groups are just a comma separated list in this sample table. You can find out more with:

mysql:research@analytics-store.eqiad.wmnet [staging]> describe mediawiki_history_sample;

Public version

This data is published as a collection of files on our dumps infrastructure: Analytics/Data_Lake/Edits/Mediawiki_history_dumps.

Schema

You can get the canonical version of the schema by running describe wmf.mediawiki_history from the beeline command line.

Note that the snapshot field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query. The snapshot field is the only field that is not present in the dump version of this data set, because it's implied in the file URL.

In the documentation below, "current" refers to the time of the snapshot, and "historical" to the time of the event.

More precise information on the notion of identity-revert can be found on this page.

col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
event_entity string revision, user or page
event_type string create, move, delete, etc. Detailed explanation in the docs under #Event_types
event_timestamp string When this event occurred. Formatted like "2020-01-03 08:31:12.0".
event_comment string Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id bigint ID of the user that caused the event. Null if the user is anonymous or if from a revision where the user has been revision deleted.
event_user_text_historical string Historical username (IP address for anonymous user) of the user that caused the event. Null for revisions where the user has been revision deleted.
event_user_text string Current username of the user that caused the event. Null for anonymous users (the IP is stored in event_user_text_historical). Null for revisions where the user has been revision deleted.
event_user_blocks_historical array<string> Historical blocks of the user that caused the event
event_user_blocks array<string> Current blocks of the user that caused the event
event_user_groups_historical array<string> Historical groups of the user that caused the event
event_user_groups array<string> Current groups of the user that caused the event
event_user_is_bot_by_historical array<string> Historical bot information of the user that caused the event, can contain values name or group
event_user_is_bot_by array<string> Bot information of the user that caused the event, can contain values name or group
event_user_is_created_by_self boolean Whether the event_user created their own account
event_user_is_created_by_system boolean Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer boolean Whether the event_user account was created by another user
event_user_is_anonymous boolean Whether the event_user is not registered. Null for revisions where the user has been revision deleted.
event_user_registration_timestamp string Registration timestamp of the user that caused the event (from user table)
event_user_creation_timestamp string Creation timestamp of the user that caused the event (from logging table)
event_user_first_edit_timestamp string Timestamp of the first edit of the user that caused the event
event_user_revision_count bigint Number of revisions made by the event_user up to the historical time in this wiki_db (only available in revision-create events so far). For revision-create events, this includes the event itself.
event_user_seconds_since_previous_revision bigint In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far)
page_id bigint In revision/page events: id of the page
page_title_historical string In revision/page events: historical title of the page
page_title string In revision/page events: current title of the page
page_namespace_historical int In revision/page events: historical namespace of the page.
page_namespace_is_content_historical boolean In revision/page events: historical namespace of the page is categorized as content
page_namespace int In revision/page events: current namespace of the page
page_namespace_is_content boolean In revision/page events: current namespace of the page is categorized as content
page_is_redirect boolean In revision/page events: whether the page is currently a redirect
page_is_deleted boolean In revision/page events: Whether the page is rebuilt from a delete event
page_creation_timestamp string In revision/page events: creation timestamp of the page
page_first_edit_timestamp string In revision/page events: timestamp of the page's first revision. Can be before the page_creation in some restore/merge cases (see revision_is_from_before_page_creation).
page_revision_count bigint In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far)
page_seconds_since_previous_revision bigint In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far)
user_id bigint In user events: id of the user
user_text_historical string In user events: historical username or IP address of the user
user_text string In user events: current username or IP address of the user
user_blocks_historical array<string> In user events: historical user blocks
user_blocks array<string> In user events: current user blocks
user_groups_historical array<string> In user events: historical user groups
user_groups array<string> In user events: current user groups
user_is_bot_by_historical array<string> In user events: Historical bot information of the user, can contain values name or group
user_is_bot_by array<string> In user events: Bot information of the user, can contain values name or group
user_is_created_by_self boolean In user events: whether the user created their own account
user_is_created_by_system boolean In user events: whether the user account was created by mediawiki
user_is_created_by_peer boolean In user events: whether the user account was created by another user
user_is_anonymous boolean In user events: whether the user is not registered
user_registration_timestamp string In user events: registration timestamp of the user.
user_creation_timestamp string In user events: Creation timestamp of the user (from logging table)
user_first_edit_timestamp string In user events: Timestamp of the first edit of the user
revision_id bigint In revision events: id of the revision
revision_parent_id bigint In revision events: id of the parent revision
revision_minor_edit boolean In revision events: whether it is a minor edit or not
revision_deleted_parts array<string> In revision events: Deleted parts of the revision, can contain values text, comment and user
revision_deleted_parts_are_suppressed boolean In revision events: Whether the deleted parts are deleted to admin as well (visibleonly by stewarts)
revision_text_bytes bigint In revision events: number of bytes of revision
revision_text_bytes_diff bigint In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1 string In revision events: sha1 hash of the revision
revision_content_model string In revision events: content model of revision
revision_content_format string In revision events: content format of revision
revision_is_deleted_by_page_deletion boolean In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_by_page_deletion_timestamp string In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted boolean In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id bigint In revision events: id of the revision that reverted this revision
revision_seconds_to_identity_revert bigint In revision events: seconds elapsed between revision posting and its revert (if there was one)
revision_is_identity_revert boolean In revision events: whether this revision reverts other revisions
revision_is_from_before_page_creation boolean In revision events: True if the revision timestamp is before the page creation (can happen with restore events)
revision_tags array<string> In revision events: Tags associated with the revision. NULL if the revision has no tags.
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports). This field is the only one that is not present in the dumps version of the data set, because it is implied in the file URL.

Important Fields

Due to the denormalization of the history data, filtering by event_entity is mandatory not to mix incompatible data.

Similarly, event_types filtering can/must be used depending of the analysis.

Entity Event type Meaning
revision create Making an edit
page create Making the first edit to a page
delete Deleting a page
move Changing a page's title
restore Undeleting a page
user create The registration of a new account
rename Changing the name of a user
altergroups The groups (rights) of a user were changed
alterblocks The user was blocked or unblocked

Changes and known problems

Date Phab

Task

Snapshot version Details
2023-11-01 task T350489 2023-10 The mediawiki_project_namespace_map table schema was updated. The update was backwards-compatible but the code used the raw data, superimposing its own schema. This was the right decision for performance when we created the job, but latest Spark makes this unnecessary. The job should be updated to use a select statement and future-proof itself. This has not been prioritized.
2023-09-01 task T344632 2023-08 A system user, "Global_rename_script", was given an id and caused a sizeable shift in data. The checker errors were ignored as false alarms.
2023-08-03 task T345208 2023-07 Fixes to how redacted actor ids show up on Cloud replicas caused downstream problems in MW history. Skew-join helper logic was updated and jobs were rerun. The checker still flagged a sizable difference, probably due to deleted users no longer being seen as valid actors. It was decided that we should ignore this difference and not vet the data further.
2022-06-01 task T309987 2022-05 Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets.
2020-08 task T259823 2020-06 Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations
2019-07 task T221825 2019-05 Schema changes:
  • Addition of page_first_edit_timestamp
  • Addition of revision_is_from_before_page_creation

Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate.

2019-05 task T221824 2019-04 Schema changes:
  • Addition of event_user_is_bot_by_historical and event_user_is_bot_by as well as user_is_bot_by_historical and user_is_bot_by
  • Addition of event_user_creation_timestamp, event_user_first_timestamp as well as user_creation_timestamp, user_first_timestamp. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not.
  • Removal (BREAKING) of event_user_is_bot_by_name and user_is_bot_name (replaced by is_bot_by above)
  • Addition of page_is_deleted
  • Addition of revision_deleted_parts and revision_deleted_parts_are_suppressed
  • Rename of revision_is_deleted to revision_is_deleted_by_page_deletion, and revision_deleted_timestamp to revision_deleted_by_page_deletion_timestamp.
  • Addition of revision_tags

Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463).

2018-10 task T209031 2018-10 and 2018-11 due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here.
2017-12 2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.

Revisions happening before page-creation date (due to restore over existing page) are now correctly linked.

History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release.

2017-06 task T161147 2017-06 Provide cumulative edit count
2017-06 task T170493 2017-06 Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings)
2016-10-06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017-03-01 n/a Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.