Data Platform/Data Lake/Content/Mediawiki wikitext history

wmf.mediawiki_wikitext_history is a dataset available in the Data Lake that provides the full content of all revisions, past and present, from Wikimedia wikis (except Wikidata).

The content is stored as unparsed Wikitext. Each monthly snapshot should arrive between the 10th and 12th of the following month.

Wikidata is excluded to reduce the total latency of the dataset from about 23 days to about 11. This shouldn't be a problem, since it is strongly recommended not to use its XML dumps.

Schema

Note: The snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name	data_type	comment
page_id	bigint	id of the page
page_namespace	int	namespace of the page
page_title	string	title of the page
page_redirect_title	string	title of the redirected-to page
page_restrictions	array<string>	restrictions of the page
user_id	bigint	id of the user that made the revision (or -1 if anonymous)
user_text	string	text of the user that made the revision (either username or IP)
revision_id	bigint	id of the revision
revision_parent_id	bigint	id of the parent revision
revision_timestamp	string	timestamp of the revision (ISO8601 format)
revision_minor_edit	boolean	whether this revision is a minor edit or not
revision_comment	string	Comment made with revision
revision_text_bytes	bigint	bytes number of the revision text
revision_text_sha1	string	sha1 hash of the revision text
revision_text	string	text of the revision
revision_content_model	string	content model of the revision
revision_content_format	string	content format of the revision
snapshot	string	Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db	string	The wiki_db project

Changes and known problems

Date	Phab Task	Snapshot version	Details
2024-03-01	task T357859	2024-02	Wikidata is now excluded in order to dramatically speed up the pipeline.
2019-11-01	task T236687	2019-10	Change underlying file format from `parquet` to `avro` to prevent memory issues at read time.
2018-09-01	task T202490	2018-09	Creation of the table. Data starts to flow regularly (every month).

Pipeline

The pages-meta-history public XML data dumps. The bottleneck is the English Wikipedia dump, which finishes between the 7th and the 9th (Wikidata generally takes until the 19th, which is why it is split into a separate job (wikidata_wikitext_history).
A Puppet-managed SystemD timer runs a Python script that imports the XML dump files into HDFS, in folders following the pattern hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_history/YYYYMMDD/WIKI_DB. Wikidata is excluded from this step.
An Airflow job refines the XML dumps into Avro data, stored in folders with the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/history/snapshot=YYYY-MM/wiki_db=WIKI_DB

Note that there is a one month difference between the snapshot value of the Avro-converted data and the raw XML data. This is because our convention in the Data Lake is that the date tells what data is available (for instance, 2019-11 means that data for 2019-11 is present), while with the dumps, the date tells when the dump process started (for instance, 20191201 means the dump started on 2019-12-01, so it will have data for 2019-11 but not 2019-12).

The data is stored in Avro format rather than Parquet to prevent memory errors due to vectorized columnar reading in Parquet.