Data Platform/Data Lake/Content/Mediawiki content current v1
wmf_content.mediawiki_content_current_v1 is a dataset available in the Data Lake that provides the full content of the latest revisions from all Wikimedia wikis. It is updated daily.
The schema of this table is similar to that of Mediawiki wikitext current. However, this table's source data is the daily updated wmf_content.mediawiki_content_history_v1. The content under this table's revision_content_slots['main'] is typically unparsed Wikitext, and Multi-Content Revisions content is also included, which as of this writing is only being used by commonswiki.
If you are instead interested in all the revisions, past and present, then dataset Mediawiki content history v1 is more appropriate.
Consuming this table will be different from snapshot-based tables like Mediawiki wikitext current. See FAQ below for details.
Schema
This table may expose some data that has been suppressed from the public via the RevisionDelete system. If the results of a query are to be made public, please honor the
*_is_visiblevisibility flags below. We should not make public any data in which the visibility has been set tofalse.
This is an Iceberg dataset.
| col_name | data_type | comment |
|---|---|---|
| page_id | bigint | The (database) page ID of the page. |
| page_namespace_id | int | The id of the namespace this page belongs to. |
| page_title | string | The normalized title of the page. If page_namespace_id = 0, then this is the non-namespaced title. If page_namespace_id != 0, then the title is prepended with the localized namespace. Examples for "enwiki": "Main_Page" and "Talk:Main_Page". |
| page_redirect_target | string | title of the redirected-to page, if any. Same rules as page_title. |
| user_id | bigint | id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported |
| user_central_id | bigint | Global cross-wiki user ID. See: https://www.mediawiki.org/wiki/Manual:Central_ID |
| user_text | string | text of the user that made the revision (either username or IP) |
| user_is_visible | boolean | Whether the user that made the revision is visible. If this is false, then the user should be redacted when shown publicly. See RevisionRecord->DELETED_USER. |
| revision_id | bigint | The (database) revision ID. |
| revision_parent_id | bigint | The (database) revision ID. of the parent revision |
| revision_dt | timestamp | The (database) time this revision was created. This is rev_timestamp in the MediaWiki database. |
| revision_is_minor_edit | boolean | True if the editor marked this revision as a minor edit. |
| revision_comment | string | The comment left by the user when this revision was made. |
| revision_comment_is_visible | boolean | Whether the comment of the revision is visible. If this is false, then the comment should be redacted when shown publicly. See RevisionRecord->DELETED_COMMENT. |
| revision_size | bigint | the sum of the content_size of all content slots' |
| revision_content_slots | MAP<
STRING,
STRUCT<
content_body: STRING,
content_format: STRING,
content_model: STRING,
content_sha1: STRING,
content_size: BIGINT,
origin_rev_id: BIGINT
>
>
|
a MAP containing all the content slots associated to this revision. Typically just the "main" slot, but also "mediainfo" for commonswiki. |
| revision_content_is_visible | boolean | Whether revision_content_slots is visible. If this is false, then any content should be redacted when shown publicly. See RevisionRecord->DELETED_TEXT. |
| wiki_id | string | The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc. |
| row_update_dt | timestamp | Control column to efficiently update this table from source at wmf_content.mediawiki_content_history_v1. |
Changes and known problems
| Date | Phab
Task |
Details |
|---|---|---|
| 2025-10-20 | T405641 and T406515 | Dropped revision_sha1. Added user_central_id, but it is not backfilled yet.
|
| 2025-10-09 | T405944 | Added origin_rev_id to the map entries of revision_content_slots to be able to do MCR File Export.
|
| 2025-05-19 | task T392494 | Added data quality metrics. Officially calling the table production quality. |
| 2025-04-30 | task T391279 | Table available in the data lake, but not production yet. We still need to put Data Quality pipelines in place. |
FAQ
All of the same FAQs as in Data Platform/Data Lake/Content/Mediawiki content history v1#FAQ apply.
This table doesn't seem to have Hive partitions, such as the usual 'snapshot' column. How do I consume it?
This table indeed does not use a snapshot partition as with other tables such as Mediawiki wikitext current. We are using a table format called Iceberg. Instead of rewriting all data like we have done before, this technology allows us to update the content of the table in place, with the main benefit being updates to the table's content on a daily cadence.
If you are building a data pipeline, and you need to define an Airflow sensor to wait on this table's updates, instead of waiting on Hive partitions, you should utilize our datasets.yaml configuration and appropriate helper functions to construct a sensor.
In your instance's datasets.yaml file:
iceberg_wmf_content_mediawiki_content_current_v1:
datastore: iceberg
table_name: wmf_content.mediawiki_content_current_v1
produced_by:
airflow:
instance: main
dag_id: mw_content_merge_changes_to_mw_content_current_daily
In your Airflow Python DAG code:
# here we use platform_eng, but you should use the config
# that applies to your Airflow instance
from platform_eng.config.dag_config import dataset
dataset_id = "iceberg_wmf_content_mediawiki_content_current_v1"
wait_for_sensor = dataset(dataset_id).get_sensor_for(dag)
Pipeline
- Every day, we wait for
wmf_content.mediawiki_content_history_v1to finish its daily ingestion. - We then run a PySpark job that:
- Uses SQL to calculate all the per page changes (be it inserts, updates or deletes) that happened upstream.
- Applies these changes to
wmf_content.mediawiki_content_current_v1via a MERGE INTO. - Because we calculate the changes via a full table scan of the upstream table and we utilize a MERGE INTO command, this pipeline is self-healing. That is, if this pipeline fails for one day, the next day will pick up the changes from both days, etc.
- Data quality checks specific to this table are done via a separate PySpark job. Data Quality checks from the upstream table also apply via transitivity.
- This pipeline is coordinated via an Airflow DAG.