Analytics/Data Lake/Edits/MediaWiki history

From Wikitech

This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in the Analytics Hadoop cluster and is accessible via the Hive table wmf.mediawiki_history.

A new snapshot covering all of history is generated from the source data each month. For more details on the process, see Analytics/Systems/Data Lake/Edits/Pipeline, and more precisely Analytics/Systems/Data Lake/Edits/Pipeline/Revision augmentation and denormaliztion.

Public version

This data is published as a collection of files on our dumps infrastructure: Analytics/Data_Lake/Edits/Mediawiki_history_dumps.

Schema

For schema documentation, see the entry in DataHub.

Changes and known problems

Date Phab

Task

Snapshot version Details
2023-11-01 task T350489 2023-10 The mediawiki_project_namespace_map table schema was updated. The update was backwards-compatible but the code used the raw data, superimposing its own schema. This was the right decision for performance when we created the job, but latest Spark makes this unnecessary. The job should be updated to use a select statement and future-proof itself. This has not been prioritized.
2023-09-01 task T344632 2023-08 A system user, "Global_rename_script", was given an id and caused a sizeable shift in data. The checker errors were ignored as false alarms.
2023-08-03 task T345208 2023-07 Fixes to how redacted actor ids show up on Cloud replicas caused downstream problems in MW history. Skew-join helper logic was updated and jobs were rerun. The checker still flagged a sizable difference, probably due to deleted users no longer being seen as valid actors. It was decided that we should ignore this difference and not vet the data further.
2022-06-01 task T309987 2022-05 Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets.
2020-08 task T259823 2020-06 Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations
2019-07 task T221825 2019-05 Schema changes:
  • Addition of page_first_edit_timestamp
  • Addition of revision_is_from_before_page_creation

Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate.

2019-05 task T221824 2019-04 Schema changes:
  • Addition of event_user_is_bot_by_historical and event_user_is_bot_by as well as user_is_bot_by_historical and user_is_bot_by
  • Addition of event_user_creation_timestamp, event_user_first_timestamp as well as user_creation_timestamp, user_first_timestamp. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not.
  • Removal (BREAKING) of event_user_is_bot_by_name and user_is_bot_name (replaced by is_bot_by above)
  • Addition of page_is_deleted
  • Addition of revision_deleted_parts and revision_deleted_parts_are_suppressed
  • Rename of revision_is_deleted to revision_is_deleted_by_page_deletion, and revision_deleted_timestamp to revision_deleted_by_page_deletion_timestamp.
  • Addition of revision_tags

Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463).

2018-10 task T209031 2018-10 and 2018-11 due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here.
2017-12 2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.

Revisions happening before page-creation date (due to restore over existing page) are now correctly linked.

History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release.

2017-06 task T161147 2017-06 Provide cumulative edit count
2017-06 task T170493 2017-06 Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings)
2016-10-06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017-03-01 n/a Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.