Data Platform/Data Lake/Content/Mediawiki content current v1
wmf_content.mediawiki_content_current_v1
is a dataset available in the Data Lake that provides the full content of the latest revisions from all Wikimedia wikis. It is updated daily.
The schema of this table is similar to that of Mediawiki wikitext current. However, this table's source data is the daily updated wmf_content.mediawiki_content_history_v1. The content under this table's revision_content_slots['main']
is typically unparsed Wikitext, and Multi-Content Revisions content is also included, which as of this writing is only being used by commonswiki
.
If you are instead interested in all the revisions, past and present, then dataset Mediawiki content history v1 is more appropriate.
Consuming this table will be different from snapshot-based tables like Mediawiki wikitext current. See FAQ below for details.
Schema
Note: This is an Iceberg dataset.
col_name | data_type | comment |
---|---|---|
page_id | bigint | The (database) page ID of the page. |
page_namespace_id | int | The id of the namespace this page belongs to. |
page_title | string | The normalized title of the page. If page_namespace_id = 0, then this is the non-namespaced title. If page_namespace_id != 0, then the title is prepended with the localized namespace. Examples for "enwiki": "Main_Page" and "Talk:Main_Page". |
page_redirect_target | string | title of the redirected-to page, if any. Same rules as page_title. |
user_id | bigint | id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported |
user_text | string | text of the user that made the revision (either username or IP) |
user_is_visible | boolean | Whether the user that made the revision is visible. If this is false, then the user should be redacted when shown publicly. See RevisionRecord->DELETED_USER. |
revision_id | bigint | The (database) revision ID. |
revision_parent_id | bigint | The (database) revision ID. of the parent revision |
revision_dt | timestamp | The (database) time this revision was created. This is rev_timestamp in the MediaWiki database. |
revision_is_minor_edit | boolean | True if the editor marked this revision as a minor edit. |
revision_comment | string | The comment left by the user when this revision was made. |
revision_comment_is_visible | boolean | Whether the comment of the revision is visible. If this is false, then the comment should be redacted when shown publicly. See RevisionRecord->DELETED_COMMENT. |
revision_sha1 | string | Nested SHA1 hash of hashes of all content slots. See https://www.mediawiki.org/wiki/Manual:Revision_table#rev_sha1 |
revision_size | bigint | the sum of the content_size of all content slots' |
revision_content_slots | MAP<
STRING,
STRUCT<
content_body: STRING,
content_format: STRING,
content_model: STRING,
content_sha1: STRING,
content_size: BIGINT
>
>
|
a MAP containing all the content slots associated to this revision. Typically just the "main" slot, but also "mediainfo" for commonswiki. |
revision_content_is_visible | boolean | Whether revision_content_slots is visible. If this is false, then any content should be redacted when shown publicly. See RevisionRecord->DELETED_TEXT. |
wiki_id | string | The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc. |
row_update_dt | timestamp | Control column to efficiently update this table from source at wmf_content.mediawiki_content_history_v1. |
Changes and known problems
Date | Phab
Task |
Details |
---|---|---|
2025-05-19 | task T392494 | Added data quality metrics. Officially calling the table production quality. |
2025-04-30 | task T391279 | Table available in the data lake, but not production yet. We still need to put Data Quality pipelines in place. |
FAQ
All of the same FAQs as in Data Platform/Data Lake/Content/Mediawiki content history v1#FAQ apply.
This table doesn't seem to have Hive partitions, such as the usual 'snapshot' column. How do I consume it?
This table indeed does not use a snapshot
partition as with other tables such as Mediawiki wikitext current. We are using a table format called Iceberg. Instead of rewriting all data like we have done before, this technology allows us to update the content of the table in place, with the main benefit being updates to the table's content on a daily cadence.
If you are building a data pipeline, and you need to define an Airflow sensor to wait on this table's updates, instead of waiting on Hive partitions, you should utilize our datasets.yaml
configuration and appropriate helper functions to construct a sensor.
In your instance's datasets.yaml
file:
iceberg_wmf_content_mediawiki_content_current_v1:
datastore: iceberg
table_name: wmf_content.mediawiki_content_current_v1
produced_by:
airflow:
instance: main
dag_id: mw_content_merge_changes_to_mw_content_current_daily
In your Airflow Python DAG code:
# here we use platform_eng, but you should use the config
# that applies to your Airflow instance
from platform_eng.config.dag_config import dataset
dataset_id = "iceberg_wmf_content_mediawiki_content_current_v1"
wait_for_sensor = dataset(dataset_id).get_sensor_for(dag)
Pipeline
- Every day, we wait for
wmf_content.mediawiki_content_history_v1
to finish its daily ingestion. - We then run a PySpark job that:
- Uses SQL to calculate all the per page changes (be it inserts, updates or deletes) that happened upstream.
- Applies these changes to
wmf_content.mediawiki_content_current_v1
via a MERGE INTO. - Because we calculate the changes via a full table scan of the upstream table and we utilize a MERGE INTO command, this pipeline is self-healing. That is, if this pipeline fails for one day, the next day will pick up the changes from both days, etc.
- Data quality checks specific to this table are done via a separate PySpark job. Data Quality checks from the upstream table also apply via transitivity.
- This pipeline is coordinated via an Airflow DAG.