Jump to content

Data Platform/Data Lake/Edits/MediaWiki history dumps

From Wikitech

This page describes the data set dump of the denormalized revision, user and page history of all WMF's wikis. It is computed from the MediaWiki History dataset, lives in the Analytics Hadoop cluster, and is downloadable from MediaWiki Dumps. A new monthly snapshot containing all history is being produced at the beginning of each month.

General Information

Content

This data set contains a historical record of revision (without text), user and page events of Wikimedia wikis since 2001. The data is denormalized, meaning that all events for user, page and revision are stored in the same schema. This leads to some fields being always null for some events (for instance fields about page are null in events about user). Events about users and pages have been processed to rebuild an as coherent as possible history in term of user-renames and page-moves (see Page and user history reconstruction). Also, some data have been preprocessed to facilitate analyses, such as edit-count per user and per page, reverting and reverted revisions and more.

Updates

The updates for this data set are monthly, around the end of the month's first week. Each update contains a full dump since 2001 (the beginning of MediaWiki-time) up to the current month. The reason for this particularity is the underlying data, the MediaWiki databases. Every time a user gets renamed, a revision reverted, a page moved, etc. the existing related records in the logging table are updated accordingly. So an event triggered today may change the state of that table 10 years ago. And it turns out the logging table is the base of the MediaWiki history reconstruction process. Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using [[1]] for real time updates on MediaWiki changes(API docs).

Versioning

Each update receives the name of the last featured month, in YYYY-MM format. For example if the dump spans from 2001 to August 2019 (included), it will be named 2019-08 even if it will be released on the first days of September 2019. There is a folder for each available month at the root of the download URL, and for storage reasons only the last two versions are available. This shouldn't be problematic as every version contains the whole historical dataset.

Partitioning

The data is organized by wiki and time range. This way it can be downloaded for a single wiki (or set of wikis). The time split is necessary because of file size reasons. There are 3 different time range splits: monthly, yearly and all-time. Very big wikis are partitioned monthly, while medium wikis are partitioned yearly, and small wikis are dumped in one single file. This way we ensure that files are not larger than ~2GB, and at the same time we prevent generating a very large number of files.

  • Wikis partitioned monthly: wikidatawiki, commonswiki, enwiki.
  • Wikis partitioned yearly: dewiki, frwiki, eswiki, itwiki, ruwiki, jawiki, viwiki, zhwiki, ptwiki, enwiktionary, plwiki, nlwiki, svwiki, metawiki, arwiki, shwiki, cebwiki, mgwiktionary, fawiki, frwiktionary, ukwiki, hewiki, kowiki, srwiki, trwiki, loginwiki, huwiki, cawiki, nowiki, mediawikiwiki, fiwiki, cswiki, idwiki, rowiki, enwikisource, frwikisource, ruwiktionary, dawiki, bgwiki, incubatorwiki, enwikinews, specieswiki, thwiki.
  • Wikis in one single file: all the others.

File format

The file format is tab-separated-value (TSV) instead of JSON in order to reduce the file sizes (JSON repeats field names for every record). Most fields of the schema are atomic (integer, string, boolean...), and a few are arrays of strings.

Some details:

  • Undefined or null values are represented as an empty fields, again to make data lighter
  • Encoding of string-arrays is value1,value2,...,valueN with commas escaped in values.
  • In text fields, carriage-returns, line-feed and tabulations are escaped with a \ to keep a valid TSV format

The files are compressed in Bzip2, for it being widely used, free software, and having a high compression rate. Note that with Bzip2, you can concatenate several compressed files and treat them as a single Bzip2 file.

Directory structure

When choosing a file (or set of files) to download, the URL should look like this:

/<version>/<wiki>/<version>.<wiki>.<time range>.tsv.bz2

Where

  • <version> is the YYYY-MM formatted snapshot i.e. 2019-08;
  • <wiki> is the wiki database name, i.e. enwiki or commonswiki;
  • <time_range> is either YYYY-MM for big wikis, YYYY for medium wikis, or all-time for the rest (see partitionning above).

Examples of dump files:

  • /2019-12/wikidatawiki/2019-12.wikidatawiki.2019-05.tsv.bz2
  • /2019-12/ptwiki/2019-12.ptwiki.2018.tsv.bz2
  • /2019-12/cawikinews/2019-12.cawikinews.all-time.tsv.bz2

Technical Documentation

Note: In the documentation below, "current" refers to the time of the snapshot, and "historical" to the time of the event. A subpage here lists answers to some frequently answered questions: Analytics/Data Lake/Edits/Mediawiki history dumps/FAQ.

Access

The easiest way to play with dumps is to use PAWS. See these example notebooks.

You can access the dumps through Toolforge. If you have a Cloud VPS instance, you can add the mount_nfs role to get the /public/dumps/public mount. But you also need to enable the mount server-side, see this patch for example. See Portal:Data_Services/Admin/Runbooks/Enable_NFS_for_a_project for full details.

Schema overview

The dataset contains many fields (70 to be precise), but there is some structure helping in making sense of them. The fields can be divided in 5 classes:

  1. event global fields -- They are used on every event of the dataset (wiki_db, event_entity, event_type, event_timestamp, event_comment).
  2. event user fields -- They provide information on the user having performed the event. They are set for all events in the dataset except when denormalizing user data has failed.
  3. page fields -- They provide information about the page the event applies to. They are set for page events (event_entity = 'page') and revision events (event_entity = 'revision').
  4. user fields -- They provide information about the user the event applies to. They are set for user events only (event_entity = 'user').
  5. revision fields -- They provide information about the revision the event applies to. They are set for revision events only (event_entity = 'revision').

Note: Except for the event global class fields whose prefix is not consistent, all other have their field name prefixed with their field class.

Important fields: event_entity and event_type

Due to having user, page and revision events in the same dataset (it is said to be denormalized), filtering by event_entity and possibly even event_type is necessary not to mix incompatible data.

Entity Event type Meaning
revision create Editing a page
page create Creating a page
create-page Page creation according to the logging table [note 1 below]
delete Deleting a page
move Changing a page's title
restore Undeleting a page
merge Merging revisions from another page [note 2 below]
user create Registering of a new account
rename Changing the name of a user
altergroups Changing the groups (rights) of a user
alterblocks Blocking/unblocking a user
  • note 1: Establishing exactly when a page was created is not simple. The logging table has a record for page creation, and we expose this in our datasets as a "create-page" event. However, the first revision for some pages is *before* this logging table entry. Therefore, we made a decision to use that event as the "create". You can follow along with our logic at PageHistoryBuilder #L778 and at PageEventBuilder #L180.
  • note 2: we don't process merges much, and the documentation is sparse: https://www.mediawiki.org/wiki/Manual:Log_actions

Schema details

Field class Field name Data type Comment
Event_global wiki_db string enwiki, dewiki, eswiktionary, etc.
event_entity string revision, user or page
event_type string create, move, delete, etc. Detailed explanation in the docs under #Event_types
event_timestamp string When this event ocurred
event_comment string Comment related to this event, sourced from log_comment, rev_comment, etc.
Event user event_user_id bigint ID of the user that caused the event. Null if the user is anonymous or if from a revision where the user has been revision deleted.
event_user_text_historical string Historical username (IP address for anonymous user) of the user that caused the event. Null for revisions where the user has been revision deleted.
event_user_text string Current username of the user that caused the event. Null for anonymous users (the IP is stored in event_user_text_historical). Null for revisions where the user has been revision deleted.
event_user_blocks_historical array<string> Historical blocks of the user that caused the event
event_user_blocks array<string> Current blocks of the user that caused the event
event_user_groups_historical array<string> Historical groups of the user that caused the event
event_user_groups array<string> Current groups of the user that caused the event
event_user_is_bot_by_historical array<string> Historical bot information of the user that caused the event, can contain values name or group
event_user_is_bot_by array<string> Bot information of the user that caused the event, can contain values name or group
event_user_is_created_by_self boolean Whether the event_user created their own account
event_user_is_created_by_system boolean Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer boolean Whether the event_user account was created by another user
event_user_is_anonymous boolean Whether the event_user is not registered, using the old way that surfaced the IP publicly. True for revisions where the user has been revision deleted, even if the user was actually registered.
event_user_is_temporary boolean Whether the event_user is not registered, using the new temporary account way. True for revisions where the user has been revision deleted, even if the user was actually registered.
event_user_is_permanent boolean Whether the event_user is registered.
event_user_registration_timestamp string Registration timestamp of the user that caused the event (from user table)
event_user_creation_timestamp string Creation timestamp of the user that caused the event (from logging table)
event_user_first_edit_timestamp string Timestamp of the first edit of the user that caused the event
event_user_revision_count bigint Number of revisions made by the event_user up to the historical time in this wiki_db (only available in revision-create events so far). For revision-create events, this includes the event itself.
event_user_seconds_since_previous_revision bigint In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far)
page page_id bigint In revision/page events: id of the page
page_title_historical string In revision/page events: historical title of the page
page_title string In revision/page events: current title of the page
page_namespace_historical int In revision/page events: historical namespace of the page.
page_namespace_is_content_historical boolean In revision/page events: historical namespace of the page is categorized as content
page_namespace int In revision/page events: current namespace of the page
page_namespace_is_content boolean In revision/page events: current namespace of the page is categorized as content
page_is_redirect boolean In revision/page events: whether the page is currently a redirect
page_is_deleted boolean In revision/page events: Whether the page is rebuilt from a delete event
page_creation_timestamp string In revision/page events: creation timestamp of the page
page_first_edit_timestamp string In revision/page events: timestamp of the page's first revision. Can be before the page_creation in some restore/merge cases (see revision_is_from_before_page_creation).
page_revision_count bigint In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far)
page_seconds_since_previous_revision bigint In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far)
user user_id bigint In user events: id of the user
user_text_historical string In user events: historical username or IP address of the user
user_text string In user events: current username or IP address of the user
user_blocks_historical array<string> In user events: historical user blocks
user_blocks array<string> In user events: current user blocks
user_groups_historical array<string> In user events: historical user groups
user_groups array<string> In user events: current user groups
user_is_bot_by_historical array<string> In user events: Historical bot information of the user, can contain values name or group
user_is_bot_by array<string> In user events: Bot information of the user, can contain values name or group
user_is_created_by_self boolean In user events: whether the user created their own account
user_is_created_by_system boolean In user events: whether the user account was created by mediawiki
user_is_created_by_peer boolean In user events: whether the user account was created by another user
user_is_anonymous boolean In user events: whether the user is not registered, using the old way that surfaced the IP publicly
user_is_temporary boolean In user events: whether the user is not registered, using the new temporary account way
user_is_permanent boolean In user events: whether the user is registered
user_registration_timestamp string In user events: registration timestamp of the user.
user_creation_timestamp string In user events: Creation timestamp of the user (from logging table)
user_first_edit_timestamp string In user events: Timestamp of the first edit of the user
revision revision_id bigint In revision events: id of the revision
revision_parent_id bigint In revision events: id of the parent revision
revision_minor_edit boolean In revision events: whether it is a minor edit or not
revision_deleted_parts array<string> In revision events: Deleted parts of the revision, can contain values text, comment and user
revision_deleted_parts_are_suppressed boolean In revision events: Whether the deleted parts are deleted to admin as well (visible only by stewards)
revision_text_bytes bigint In revision events: number of bytes of revision
revision_text_bytes_diff bigint In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1 string In revision events: sha1 hash of the revision
revision_content_model string In revision events: content model of revision
revision_content_format string In revision events: content format of revision
revision_is_deleted_by_page_deletion boolean In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_by_page_deletion_timestamp string In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted boolean In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id bigint In revision events: id of the revision that reverted this revision
revision_seconds_to_identity_revert bigint In revision events: seconds elapsed between revision posting and its revert (if there was one)
revision_is_identity_revert boolean In revision events: whether this revision reverts other revisions
revision_is_from_before_page_creation boolean In revision events: True if the revision timestamp is before the page creation (can happen with restore events)
revision_tags array<string> In revision events: Tags associated to the revision

Code examples

Changes and known problems

Date PhabTask Snapshot version Details
2020-01 2019-12 Initial release