Analytics/Data Lake/Content/XMLDumps/Mediawiki wikitext history

From Wikitech
Jump to navigation Jump to search

This page describes the dataset on HDFS and Hive that stores the full-historical-revision wikitext history of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_history. A new monthly snapshot is being produced around the 20th of each month (last xml-dumps is made available the 16th); to check whether it is ready to be queried, one can view the status of the mediawiki-history-wikitext-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.

Since this table is very big, one should increase the Hive client's heap size to avoid out of memory errors, e.g. as follows: export HADOOP_HEAPSIZE=4096 && hive

Schema

You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_history from the hive/beeline/spark command line.

Note that the snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
page_id bigint id of the page
page_namespace int namespace of the page
page_title string title of the page
page_redirect_title string title of the redirected-to page
page_restrictions array<string> restrictions of the page
user_id bigint id of the user that made the revision (or null/0 if anonymous)
user_text string text of the user that made the revision (either username or IP)
revision_id bigint id of the revision
revision_parent_id bigint id of the parent revision
revision_timestamp string timestamp of the revision (ISO8601 format)
revision_minor_edit boolean whether this revision is a minor edit or not
revision_comment string Comment made with revision
revision_text_bytes bigint bytes number of the revision text
revision_text_sha1 string sha1 hash of the revision text
revision_text string text of the revision
revision_content_model string content model of the revision
revision_content_format string content format of the revision
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db string The wiki_db project

Changes and known problems

Date Phab

Task

Snapshot version Details
2018-09-01 Task T202490 2018-09 Creation of the table. Data starts to flow regularly (every month).


See also