Analytics/Data Lake/Edits/Mediawiki page history

From Wikitech
Jump to navigation Jump to search

This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_page_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.


col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
page_id bigint Id of the page, as in the page table.
page_artificial_id string Generated Id for deleted pages without real Id.
page_creation_timestamp string Creation timestamp of the page.
page_title_historical string Historical page title, with spaces replaced by underscores.
page_title string Page title as of today, with spaces replaced by underscores.
page_namespace_historical int Historical namespace.
page_namespace_is_content_historical boolean Whether the historical namespace is categorized as content
page_namespace int Namespace as of today.
page_namespace_is_content boolean Whether the current namespace is categorized as content
page_is_redirect boolean In revision/page events: whether the page is currently a redirect
page_is_deleted boolean Whether the page is rebuilt from a delete event
start_timestamp string Timestamp from where this state applies (inclusive).
end_timestamp string Timestamp to where this state applies (exclusive).
caused_by_event_type string Event that caused this state (create, move, delete or restore).
caused_by_user_id bigint ID from the user that caused this state.
caused_by_user_text string Name of the user that caused this state
inferred_from string If non-NULL, some fields have been inferred from an inconsistency in the source data.
source_log_id bigint ID of the logging table row that caused this state
source_log_comment string Comment of the logging table row that caused this state
source_log_params map<string,string> Parameters of the logging table row that caused this state, parsed as a map
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.


Changes and known problems


or date

Details Phab


2019-04 Schema changes (no breaking change, only new fields): Addition of page_is_deleted, caused_by_user_text, source_log_id, source_log_comment, source_log_params.

Change in how delete/restore are handled: restore was supposed to always create a new page_id, when it actually doesn't - It either restores a page that was deleted if no page is present with the given title, or do nothing if a page already exist with the given title (restore-into --> merge revisions from a previously deleted page with the given title into an existing page).

Task T221824
2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
2016/10/06 The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 Add the snapshot partition, allowing to keep multiple versions of the page history. Data starts to flow regularly (every month) from labs.