Analytics/Data Lake/Edits/Mediawiki history dumps/FAQ
How are rows ordered?
This dataset is centered around edit activity. Each row with an
event_entity == 'revision' represents an edit to a page, but carries with it information that can make the dataset confusing. As you get familiar with it, remember that each row is an edit, and every column stores additional context about that edit. So, for example, a user's edit count goes up when they make an edit. The fields that start with
event_user store context about the user making the edit. So the
event_user_revision_count field stores this user's updated edit count, incremented by one since their previous edit. If you need page-centric measures, look at the
page_ fields. For example, the
page_revision_count field will be incremented by one by this edit. To get the total number of edits on a single page, look for the row with the greatest
event_timestamp for that page, and read the
How do I get the number of unique editors of a given page?
None of the fields will help you to do this now, but we're always looking to improve the usefulness of this dataset, you're welcome to submit a request like this on our Phabricator board.
Doing certain kinds of analysis on this data is very expensive, any plans to make it queryable?
We're trying to find ways to publish this in a queryable form, because we understand it makes some questions easy to answer while others remain expensive.