Jump to content

Data Platform/Data Lake/Edits/MediaWiki history dumps/FAQ

From Wikitech

How are rows ordered?

This dataset is centered around edit activity. Each row with an event_entity == 'revision' represents an edit to a page, but carries with it information that can make the dataset confusing. As you get familiar with it, remember that each row is an edit, and every column stores additional context about that edit. So, for example, a user's edit count goes up when they make an edit. The fields that start with event_user store context about the user making the edit. So the event_user_revision_count field stores this user's updated edit count, incremented by one since their previous edit. If you need page-centric measures, look at the page_ fields. For example, the page_revision_count field will be incremented by one by this edit. To get the total number of edits on a single page, look for the row with the greatest event_timestamp for that page, and read the page_revision_count.

How do I get the number of unique editors of a given page?

None of the fields will help you to do this now, but we're always looking to improve the usefulness of this dataset, you're welcome to submit a request like this on our Phabricator board.

Doing certain kinds of analysis on this data is very expensive, any plans to make it queryable?

We're trying to find ways to publish this in a queryable form, because we understand it makes some questions easy to answer while others remain expensive.