Data Platform/Data Lake/Edits/MediaWiki history
This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in the Analytics Hadoop cluster and is accessible via the Hive table wmf.mediawiki_history
.
A new snapshot covering all of history is generated from the source data each month. For more details on the process, see Analytics/Systems/Data Lake/Edits/Pipeline, and more precisely Analytics/Systems/Data Lake/Edits/Pipeline/Revision augmentation and denormaliztion.
Public version
This data is published as a collection of files on our dumps infrastructure: Analytics/Data_Lake/Edits/Mediawiki_history_dumps.
Schema
For schema documentation, see the entry in DataHub.
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2023-11-01 | task T350489 | 2023-10 | The mediawiki_project_namespace_map table schema was updated. The update was backwards-compatible but the code used the raw data, superimposing its own schema. This was the right decision for performance when we created the job, but latest Spark makes this unnecessary. The job should be updated to use a select statement and future-proof itself. This has not been prioritized. |
2023-09-01 | task T344632 | 2023-08 | A system user, "Global_rename_script", was given an id and caused a sizeable shift in data. The checker errors were ignored as false alarms. |
2023-08-03 | task T345208 | 2023-07 | Fixes to how redacted actor ids show up on Cloud replicas caused downstream problems in MW history. Skew-join helper logic was updated and jobs were rerun. The checker still flagged a sizable difference, probably due to deleted users no longer being seen as valid actors. It was decided that we should ignore this difference and not vet the data further. |
2022-06-01 | task T309987 | 2022-05 | Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets. |
2020-08 | task T259823 | 2020-06 | Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations |
2019-07 | task T221825 | 2019-05 | Schema changes:
Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate. |
2019-05 | task T221824 | 2019-04 | Schema changes:
Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463). |
2018-10 | task T209031 | 2018-10 and 2018-11 | due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here. |
2017-12 | 2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
Revisions happening before page-creation date (due to restore over existing page) are now correctly linked. History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release. | |
2017-06 | task T161147 | 2017-06 | Provide cumulative edit count |
2017-06 | task T170493 | 2017-06 | Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings) |
2016-10-06 | n/a | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017-03-01 | n/a | Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.
|