Analytics/Systems/Data Lake/Edits/Pipeline/Page and user history reconstruction

From Wikitech
Jump to: navigation, search

This process is the second step of the Data Lake pipeline.

WMF's MediaWiki databases store the full history of revisions (and archives) for each project. However, they only store the current state of other entities such as pages or users; for example: you can only retrieve the current title and namespace of a given page, and can not get the title and namespace that page had at some previous point in time. Similarly, you can only know the blocks a given user has right now, and can not get the blocks that user got through time. Having the full history of pages and users would be highly valuable for various reasons (see the 'Use cases' section below), and this 'Page and user history reconstruction' page is about why and how we did it.

Why

Use cases

  • Query across all wikis on the same data set
  • Query the history of a given page or user
  • Know the namespace and title of a page at a given point in time
  • Get a "screenshot" of a wiki's pages and users at a given point in time
  • Back-fill metrics about wiki pages / users until the beginning of wiki-time
  • Intermediate data source for the creation and population of other refined data sets
  • ...

How

The blessing and the curse: The logging table

The logging table is the reason we are able to reconstruct the pages and the users histories. Thanks a lot logging table! It stores lots of important information about page events (move, delete and restore) and user events (rename, block, rights), which both go back to the (almost) beginning of wiki-time. However, this information is stored in a way that makes it very difficult, if not impossible, to query for analytical or research purposes: The page and user IDs are missing in the majority of events, they have the page title / user name only which is not identifying by itself, because pages and users get renamed all the time; The information is stored in many different formats depending on the MediaWiki version that generated the events; Some fields are stored using a PHP-encoded blob, for which we had to write a decoder. Namespace, for instance, is stored sometimes as a number, and sometimes as localized text, that varies depending on the project's language. Also, the duration of a user block is stored in 7 different date formats. And many properties of the pages and users are stored in different fields, depending on the age of the event.

What does the algorithm do?

The algorithm is a Spark-Scala job that runs on the Analytic's Hadoop cluster.

  1. Read data from all wikis and process it together. The gathering of MediaWiki data is actually not done by this algorithm, it is copied onto the Analytics' Hadoop cluster from our databases (see the Data loading page). But the algorithm can process any number of wikis together, and the resulting page and user histories are stored as single data sets.
  2. Parsing and normalization of the data. Parse the data in different formats depending on MediaWiki version, and unify them in a normalized form. For example, parsing the localized text namespace and determining which canonical namespace it corresponds and store its integer identifier, or parsing the groups a user belongs to from a text field and store it as an array of strings.
  3. Rebuild page and user history using page title and user name. Starting from the information retrieved from the current page table and user table, it iterates the logging events (page rename, page delete, page restore, user rename, user rights, user blocks) backwards in time and reconstructs the history chain. It takes advantage of the fact that at any point in time there can only exist one page with the same title and namespace or one user with the same user name. After the reconstruction all events/states possess the corresponding page or user IDs. For more details on that algorithm and some optimizations trick we needed to find, see the Page and user history reconstruction algorithm and optimizations page.
  4. Conflict resolution. Many times, there are events missing in the logging table, or some of them may be corrupt. This causes conflicts in the history reconstruction that would violate the invariant principle of unique title/username. The algorithm takes care of them and infers the most likely scenario, prioritizing correctness over completeness, meaning it's better that records are correct and incomplete, than complete and incorrect.

Resulting data sets

The resulting data set format is in the form of states. Each state record has a start time stamp and an end time stamp. The values of the record apply for the time range between those time stamps (start time stamp is inclusive, and end time stamp is exclusive). See more detailed information about both data sets here: Analytics/Data Lake/Schemas/Mediawiki page history and Analytics/Data Lake/Schemas/Mediawiki user history.