Analytics/Data Lake/Edits/MediaWiki history/Revision identity reverts

From Wikitech

This page summarizes an interaction on Slack about the definition of the notion of revision identity-revert in the mediawiki_history dataset. This notion is at heart of 4 fields of the dataset: revision_is_identity_revert, revision_is_identity_reverted, revision_seconds_to_identity_revert and revision_first_identity_reverting_revision_id.

Definition

A revision RR is an identity-revert of a previous revision if they have the same content-hash as computed by mediawiki. This content-hash is available in the mediawiki revision table, field rev_sha1 (see https://www.mediawiki.org/wiki/Manual:Revision_table). In this case in the mediawiki_history dataset the revision RR will have the revision_is_identity_revert field set to true. It is interesting to notice that multiple revisions can be reverts to a single original revision, the first one in time for the revision page to have the content-hash reused by following revisions.

The second aspect covered by the identity-revert is the one of reverted revisions. A revision RD is flagged as reverted (revision_is_identity_reverted set to true) if it happens between two revisions (RO, RR) where RR is an identity-revert to RO (they share the same content-hash). In term of content, a reverted revision is a revision whose content has been removed in favor of a previously existing content. It is classical to see multiple revisions being reverted by a a single revert. To help with tracking analyzing revert patterns, we have added two fields referencing the first reverting revision of a reverted revision: revision_first_identity_reverting_revision_id and revision_seconds_to_identity_revert.

Relation to Mediawiki Revert Types

Mediawiki defines 3 revert types: https://www.mediawiki.org/wiki/Manual:Reverts#Revert_types. Those revert types can be identified in the mediawiki_history dataset using the revision_tags field.

On July 11th 2022 Isaac Johnson (Thank you so much Isaac :) ran an analysis on May 2022 data of English Wikipedia (including namespace 0 only revisions, and excluding redirects):

+----------+----+--------+------+--------+
|mw_history|undo|rollback|manual|# edits |
+----------+----+--------+------+--------+
|false     |0   |0       |0     |3100382 |  # edits that weren't reverts by any definition

# reverts captured in mediawiki history
# vast majority also show up as edit tags
# but some small proportion were manual and
# missed by mediawiki for some reason
|true      |1   |0       |0     |109548  |  # straightforward undos
|true      |0   |1       |0     |54032   |  # straightforward rollbacks
|true      |0   |0       |1     |49750   |  # straightforward manual reverts
|true      |0   |0       |0     |20827   |  # manual reverts found by mediawiki_history table but not identified by mediawiki

# reverts not captured in mediawiki history
# vast majority are undos where presumably the editor
# changed something too and so generated a unique hash
|false     |1   |0       |0     |7711    |  # undo
|false     |0   |0       |1     |8       |  # manual
|false     |0   |1       |0     |2       |  # rollback
+----------+----+--------+------+--------+

For reference, here is the code of the analysis:

WITH reverts AS (
    SELECT
      revision_is_identity_revert as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-undo'), 1, 0) AS undo,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-rollback'), 1, 0) AS rollback,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-manual-revert'), 1, 0) as manual
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  mw_history,
  undo,
  rollback,
  manual,
  COUNT(1)
FROM reverts
GROUP BY
  mw_history,
  undo,
  rollback,
  manual