Analytics/Data Lake/Content/Mediawiki wikitext history

From Wikitech

wmf.mediawiki_wikitext_history is a dataset available in the Data Lake that provides the full content of all revisions, past and present, from Wikimedia wikis (except Wikidata).

The content is stored as unparsed Wikitext. Each monthly snapshot should arrive between the 10th and 12th of the following month.

Wikidata is excluded to reduce the total latency of the dataset from about 23 days to about 11. This shouldn't be a problem, since it is strongly recommended not to use its XML dumps.

Schema

Note: The snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
page_id bigint id of the page
page_namespace int namespace of the page
page_title string title of the page
page_redirect_title string title of the redirected-to page
page_restrictions array<string> restrictions of the page
user_id bigint id of the user that made the revision (or -1 if anonymous)
user_text string text of the user that made the revision (either username or IP)
revision_id bigint id of the revision
revision_parent_id bigint id of the parent revision
revision_timestamp string timestamp of the revision (ISO8601 format)
revision_minor_edit boolean whether this revision is a minor edit or not
revision_comment string Comment made with revision
revision_text_bytes bigint bytes number of the revision text
revision_text_sha1 string sha1 hash of the revision text
revision_text string text of the revision
revision_content_model string content model of the revision
revision_content_format string content format of the revision
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db string The wiki_db project

Changes and known problems

Date Phab

Task

Snapshot version Details
2024-03-01 task T357859 2024-02 Wikidata is now excluded in order to dramatically speed up the pipeline.
2019-11-01 task T236687 2019-10 Change underlying file format from parquet to avro to prevent memory issues at read time.
2018-09-01 task T202490 2018-09 Creation of the table. Data starts to flow regularly (every month).

Pipeline

  1. The pages-meta-history public XML data dumps. The bottleneck is the English Wikipedia dump, which finishes between the 7th and the 9th (Wikidata generally takes until the 19th, which is why it is excluded).
  2. A Puppet-managed SystemD timer runs a Python script that imports the XML dump files into HDFS, in folders following the pattern hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_history/YYYYMMDD/WIKI_DB. Wikidata is excluded from this step.
  3. An Airflow job refines the XML dumps into Avro data, stored in folders with the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/history/snapshot=YYYY-MM/wiki_db=WIKI_DB

Note that there is a one month difference between the snapshot value of the Avro-converted data and the raw XML data. This is because our convention in the Data Lake is that the date tells what data is available (for instance, 2019-11 means that data for 2019-11 is present), while with the dumps, the date tells when the dump process started (for instance, 20191201 means the dump started on 2019-12-01, so it will have data for 2019-11 but not 2019-12).

The data is stored in Avro format rather than Parquet to prevent memory errors due to vectorized columnar reading in Parquet.