Jump to content

Data Platform/Data Lake/Content/Mediawiki content current v1

From Wikitech

wmf_content.mediawiki_content_current_v1 is a dataset available in the Data Lake that provides the full content of the latest revisions from all Wikimedia wikis. It is updated daily.

The schema of this table is similar to that of Mediawiki wikitext current. However, this table's source data is the daily updated wmf_content.mediawiki_content_history_v1. The content under this table's revision_content_slots['main'] is typically unparsed Wikitext, and Multi-Content Revisions content is also included, which as of this writing is only being used by commonswiki.

If you are instead interested in all the revisions, past and present, then dataset Mediawiki content history v1 is more appropriate.

Consuming this table will be different from snapshot-based tables like Mediawiki wikitext current. See FAQ below for details.

Schema

Note: This is an Iceberg dataset.

col_name data_type comment
page_id bigint The (database) page ID of the page.
page_namespace_id int The id of the namespace this page belongs to.
page_title string The normalized title of the page. If page_namespace_id = 0, then this is the non-namespaced title. If page_namespace_id != 0, then the title is prepended with the localized namespace. Examples for "enwiki": "Main_Page" and "Talk:Main_Page".
page_redirect_target string title of the redirected-to page, if any. Same rules as page_title.
user_id bigint id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported
user_text string text of the user that made the revision (either username or IP)
user_is_visible boolean Whether the user that made the revision is visible. If this is false, then the user should be redacted when shown publicly. See RevisionRecord->DELETED_USER.
revision_id bigint The (database) revision ID.
revision_parent_id bigint The (database) revision ID. of the parent revision
revision_dt timestamp The (database) time this revision was created. This is rev_timestamp in the MediaWiki database.
revision_is_minor_edit boolean True if the editor marked this revision as a minor edit.
revision_comment string The comment left by the user when this revision was made.
revision_comment_is_visible boolean Whether the comment of the revision is visible. If this is false, then the comment should be redacted when shown publicly. See RevisionRecord->DELETED_COMMENT.
revision_sha1 string Nested SHA1 hash of hashes of all content slots. See https://www.mediawiki.org/wiki/Manual:Revision_table#rev_sha1
revision_size bigint the sum of the content_size of all content slots'
revision_content_slots
MAP<
  STRING,
  STRUCT<
    content_body:   STRING,
    content_format: STRING,
    content_model:  STRING,
    content_sha1:   STRING,
    content_size:   BIGINT
  >
>
a MAP containing all the content slots associated to this revision. Typically just the "main" slot, but also "mediainfo" for commonswiki.
revision_content_is_visible boolean Whether revision_content_slots is visible. If this is false, then any content should be redacted when shown publicly. See RevisionRecord->DELETED_TEXT.
wiki_id string The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.
row_update_dt timestamp Control column to efficiently update this table from source at wmf_content.mediawiki_content_history_v1.

Changes and known problems

Date Phab

Task

Details
2025-05-19 task T392494 Added data quality metrics. Officially calling the table production quality.
2025-04-30 task T391279 Table available in the data lake, but not production yet. We still need to put Data Quality pipelines in place.

FAQ

All of the same FAQs as in Data Platform/Data Lake/Content/Mediawiki content history v1#FAQ apply.

This table doesn't seem to have Hive partitions, such as the usual 'snapshot' column. How do I consume it?

This table indeed does not use a snapshot partition as with other tables such as Mediawiki wikitext current. We are using a table format called Iceberg. Instead of rewriting all data like we have done before, this technology allows us to update the content of the table in place, with the main benefit being updates to the table's content on a daily cadence.

If you are building a data pipeline, and you need to define an Airflow sensor to wait on this table's updates, instead of waiting on Hive partitions, you should utilize our datasets.yaml configuration and appropriate helper functions to construct a sensor.

In your instance's datasets.yaml file:

iceberg_wmf_content_mediawiki_content_current_v1:
  datastore: iceberg
  table_name: wmf_content.mediawiki_content_current_v1
  produced_by:
    airflow:
      instance: main
      dag_id: mw_content_merge_changes_to_mw_content_current_daily

In your Airflow Python DAG code:

# here we use platform_eng, but you should use the config
# that applies to your Airflow instance
from platform_eng.config.dag_config import dataset

dataset_id = "iceberg_wmf_content_mediawiki_content_current_v1"

wait_for_sensor = dataset(dataset_id).get_sensor_for(dag)

Pipeline

  1. Every day, we wait for wmf_content.mediawiki_content_history_v1 to finish its daily ingestion.
  2. We then run a PySpark job that:
    1. Uses SQL to calculate all the per page changes (be it inserts, updates or deletes) that happened upstream.
    2. Applies these changes to wmf_content.mediawiki_content_current_v1 via a MERGE INTO.
    3. Because we calculate the changes via a full table scan of the upstream table and we utilize a MERGE INTO command, this pipeline is self-healing. That is, if this pipeline fails for one day, the next day will pick up the changes from both days, etc.
  3. Data quality checks specific to this table are done via a separate PySpark job. Data Quality checks from the upstream table also apply via transitivity.
  4. This pipeline is coordinated via an Airflow DAG.