Analytics/Data Lake/Content/Mediawiki wikitext history

From Wikitech
Jump to navigation Jump to search

This page describes the dataset on HDFS and Hive that stores the full-historical-revision wikitext history of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_history. A new monthly snapshot is being produced around the 20th of each month (last xml-dumps is made available the 16th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-history-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.

Since 2019-10 snapshot, underlying data is stored in avro instead of parquet file format. This almost doesn't change data size nor processing time, and prevents memory errors due to vectorized columnar reading in parquet. Data is stored on HDFS at path following pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/history/snapshot=YYYY-MM/wiki_db=WIKI_DB


You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_history from the hive/beeline/spark command line.

Note: The snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
page_id bigint id of the page
page_namespace int namespace of the page
page_title string title of the page
page_redirect_title string title of the redirected-to page
page_restrictions array<string> restrictions of the page
user_id bigint id of the user that made the revision (or -1 if anonymous)
user_text string text of the user that made the revision (either username or IP)
revision_id bigint id of the revision
revision_parent_id bigint id of the parent revision
revision_timestamp string timestamp of the revision (ISO8601 format)
revision_minor_edit boolean whether this revision is a minor edit or not
revision_comment string Comment made with revision
revision_text_bytes bigint bytes number of the revision text
revision_text_sha1 string sha1 hash of the revision text
revision_text string text of the revision
revision_content_model string content model of the revision
revision_content_format string content format of the revision
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db string The wiki_db project

Changes and known problems

Date Phab


Snapshot version Details
2019-11-01 task T236687 2019-10 Change underlying file format from parquet to avro to prevent memory issues at read time.
2018-09-01 task T202490 2018-09 Creation of the table. Data starts to flow regularly (every month).

XMLDumps Row Data

The mediawiki_wikitext_history dataset is computed from the pages_meta_history XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_history/YYYYMMDD/WIKI_DB

Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).

See also