User:Milimetric/Notebook/MediaWiki History

Overview

Dataset Fact Sheet
Last Updated
Latest Row Count
Update Frequency
Expected date of next update
Trusted Dataset

Description

This dataset was built to make it easy to analyse editing activity on Wikimedia projects.

It transforms public data from the MediaWiki databases and makes sense of it by:

Rebuilding history for certain properties, like page titles and user groups
Normalising the fields that were stored differently across time
Computing valuable fields while processing
Gathering similar concepts in one place

Performing these aggregations can take anywhere from several hours to multiple days to complete using the raw databases. This is due to the source table not being optimised for analytics. This dataset tries to address these shortcoming by providing this data in a shape that is optimal for analysis and provides an immediate monthly view of what used to take days to compute.

High Level Process

The next sections will go into detail about the overall process and each type of transformation.

First the history of each user is reconstructed. All the different usernames they had, how their user groups have changed, the periods of time they were blocked, and so on. The results are stored in the MediaWiki User History dataset where for each user there is a set of rows with (start, end) timestamps. The properties in that row are valid for that user id within that time period.

Next the same thing is done for pages. For each page we record page moves, whether the page was deleted, restored, and so on. The results are stored in the MediaWiki Page History dataset, with the same (start, end) approach we use for Users.

Finally, for each revision we join to the historically correct slice of User and Page state from above. This gives us a full picture of the history of that edit at the time it took place. We also add as much value as possible to these records, seen below in "Compute Fields". The resulting row with a lot of columns can be understood by looking at the column prefixes:

event_ columns detail the event itself and the performer of the event
page_ columns detail the page that the event acted upon (blank in revision or user events)
user_ columns detail the user that the event acted upon (blank in revision or page events)
revision_ columns detail the edit being performed (blank in user or page events)

Rebuild History (the _historical fields)

MediaWiki databases store the latest state for its users, pages, links, and so on. To find out what groups a user belonged to a few months ago, you'd have to backtrack through the Logging table. Same for what title a page had, a username, a user's blocked status, and so on. This is ok if you're looking for details on a single edit, user, or page, but it can be prohibitive in aggregate. In this dataset, we rebuild the history of some fields, and make it available as columns with the _historical suffix (see the schema). These fields store the value as it was at the time of the event (event_timestamp). This blog post goes into more detail.

Normalize and Repair

As people edit and interact with Wikimedia projects, MediaWiki stores information about these changes to the Logging table. The format it uses to do this varies across time, and old records are not updated to the most recent format. We extract events out of the logging table, like user renames, user group changes, page deletes, and so on. For each of these events we create a stable schema and extract whatever data we can out of each different logging table row format. In cases where a record exists but has incomplete data we label any values as inferred. And in cases where we think an event should be, but no event exists, we create one. Say a page has title A at one point in time and title B later on, we infer that a "page move" event must have happened and fill it with the most probable details it would've contained.

Compute Fields

Some seemingly simple questions are very hard for MediaWiki to answer with its current schema. For example, how many bytes did an edit change? You would have to join the Revision table to itself, which is infeasible on large wikis. Running this large job on a distributed compute cluster, we get to answer these kinds of questions across all revisions. We put these results directly next to each revision to make lots of new types of analysis possible. Each computed field is documented in the column documentation in the schema, but let's go into detail on a few of them here:

event_user_creation_timestamp is simple, doesn't require too much computation: we just look up the user creation timestamp in the Logging table. This is often different from the user table, and hard to find with a join in large databases.
event_user_revision_count is an example of a field that we could easily adapt to a need you have. Right now it's just counting all the edits this user was responsible for, similar to user_editcount. But we could easily add a filter there to not count reverted edits.
revision_text_bytes_diff and revision_is_identity_reverted are more complicated to compute, which means sometimes they're not pieces of *other* analyses, but often the analysis itself. With these kinds of properties easily available, the careful user can build more sophisticated questions.

Gather

For performance reasons, MediaWiki sometimes splits up, or normalizes, pieces of the same concept out to different tables. Some user details go in the user table while some go in the logging table, revisions from deleted pages go in the archive table, and so on. This is great for MediaWiki and its mission to render its views for everyday work, but it can hinder analysis. We try to counteract this by de-normalizing concepts (normalize across time, denormalize across concepts). So, for example, rows from archive and revision are both revisions but in the archive table we have revisions from deleted pages. So we add an is_deleted flag and keep all revisions in the same place, this way you can toggle including/excluding them more easily.