Data Platform/Data Lake/Edits/MediaWiki history/Revision identity reverts
This page summarizes an interaction on Slack about the definition of the notion of revision identity-revert in the mediawiki_history
dataset. This notion is at heart of 4 fields of the dataset: revision_is_identity_revert
, revision_is_identity_reverted
, revision_seconds_to_identity_revert
and revision_first_identity_reverting_revision_id
.
Definition
A revision RR is an identity-revert of a previous revision if they have the same content-hash as computed by mediawiki. This content-hash is available in the mediawiki revision
table, field rev_sha1
(see https://www.mediawiki.org/wiki/Manual:Revision_table). In this case in the mediawiki_history
dataset the revision RR will have the revision_is_identity_revert
field set to true. It is interesting to notice that multiple revisions can be reverts to a single original revision, the first one in time for the revision page to have the content-hash reused by following revisions.
The second aspect covered by the identity-revert is the one of reverted revisions. A revision RD is flagged as reverted (revision_is_identity_reverted
set to true) if it happens between two revisions (RO, RR) where RR is an identity-revert to RO (they share the same content-hash). In term of content, a reverted revision is a revision whose content has been removed in favor of a previously existing content. It is classical to see multiple revisions being reverted by a a single revert. To help with tracking analyzing revert patterns, we have added two fields referencing the first reverting revision of a reverted revision: revision_first_identity_reverting_revision_id
and revision_seconds_to_identity_revert.
Relation to Mediawiki Revert Types
Mediawiki defines 3 revert types: https://www.mediawiki.org/wiki/Manual:Reverts#Revert_types. Those revert types can be identified in the mediawiki_history
dataset using the revision_tags
field.
On July 11th 2022 Isaac Johnson (Thank you so much Isaac :) ran an analysis on May 2022 data of English Wikipedia (including namespace 0 only revisions, and excluding redirects):
+----------+----+--------+------+--------+ |mw_history|undo|rollback|manual|# edits | +----------+----+--------+------+--------+ |false |0 |0 |0 |3100382 | # edits that weren't reverts by any definition # reverts captured in mediawiki history # vast majority also show up as edit tags # but some small proportion were manual and # missed by mediawiki for some reason |true |1 |0 |0 |109548 | # straightforward undos |true |0 |1 |0 |54032 | # straightforward rollbacks |true |0 |0 |1 |49750 | # straightforward manual reverts |true |0 |0 |0 |20827 | # manual reverts found by mediawiki_history table but not identified by mediawiki # reverts not captured in mediawiki history # vast majority are undos where presumably the editor # changed something too and so generated a unique hash |false |1 |0 |0 |7711 | # undo |false |0 |0 |1 |8 | # manual |false |0 |1 |0 |2 | # rollback +----------+----+--------+------+--------+
For reference, here is the code of the analysis:
WITH reverts AS (
SELECT
revision_is_identity_revert as mw_history,
IF(ARRAY_CONTAINS(revision_tags, 'mw-undo'), 1, 0) AS undo,
IF(ARRAY_CONTAINS(revision_tags, 'mw-rollback'), 1, 0) AS rollback,
IF(ARRAY_CONTAINS(revision_tags, 'mw-manual-revert'), 1, 0) as manual
FROM wmf.mediawiki_history
WHERE
snapshot = '2022-05'
AND wiki_db = 'enwiki'
AND page_namespace = 0
AND NOT page_is_redirect
AND event_type = 'create'
AND event_entity = 'revision'
AND event_timestamp >= '2022-05-01'
AND event_timestamp < '2022-06-01'
)
SELECT
mw_history,
undo,
rollback,
manual,
COUNT(1)
FROM reverts
GROUP BY
mw_history,
undo,
rollback,
manual