Data Platform/Data Lake/Edits/Mediawiki page history
This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive table wmf.mediawiki_page_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
| col_name | data_type | comment |
|---|---|---|
| wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
| page_id | bigint | Id of the page, as in the page table. |
| page_artificial_id | string | Generated Id for deleted pages without real Id. |
| page_creation_timestamp | string | Creation timestamp of the page. |
| page_first_edit_timestamp | string | Timestamp of the page\'s first revision (can be before page_creation in restore/merge cases). |
| page_title_historical | string | Historical page title, with spaces replaced by underscores. |
| page_title | string | Page title as of today, with spaces replaced by underscores. |
| page_namespace_historical | int | Historical namespace. |
| page_namespace_is_content_historical | boolean | Whether the historical namespace is categorized as content |
| page_namespace | int | Namespace as of today. |
| page_namespace_is_content | boolean | Whether the current namespace is categorized as content |
| page_is_redirect | boolean | In revision/page events: whether the page is currently a redirect |
| page_is_deleted | boolean | Whether the page is rebuilt from a delete event |
| start_timestamp | string | Timestamp from where this state applies (inclusive). |
| end_timestamp | string | Timestamp to where this state applies (exclusive). |
| caused_by_event_type | string | Event that caused this state (create, move, delete or restore). |
| caused_by_user_id | bigint | ID from the user that caused this state. |
| caused_by_user_text | string | Name of the user that caused this state |
| caused_by_anonymous_user | boolean | Whether the user that caused this state was anonymous |
| inferred_from | string | If non-NULL, some fields have been inferred from an inconsistency in the source data. |
| source_log_id | bigint | ID of the logging table row that caused this state |
| source_log_comment | string | Comment of the logging table row that caused this state |
| source_log_params | map<string,string> | Parameters of the logging table row that caused this state, parsed as a map |
| snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.
FAQ
Changes and known problems
| Snapshot
or date |
Details | Phab
Task |
|---|---|---|
| 2019-07 | Schema changes: Addition of caused_by_anonymous_user and page_first_edit_timestamp.
|
task T221825 |
| 2019-04 | Schema changes (no breaking change, only new fields): Addition of page_is_deleted, caused_by_user_text, source_log_id, source_log_comment, source_log_params.
Change in how delete/restore are handled: restore was supposed to always create a new page_id, when it actually doesn't - It either restores a page that was deleted if no page is present with the given title, or do nothing if a page already exist with the given title (restore-into --> merge revisions from a previously deleted page with the given title into an existing page). |
task T221824 |
| 2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
|
|
| 2016/10/06 | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
| 2017/03/01 | Add the snapshot partition, allowing to keep multiple versions of the page history. Data starts to flow regularly (every month) from labs.
|