SLO/MediaWiki Content History Table
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Organizational
Service
The Mediawiki content history v1 and Mediawiki content current v1 are a set of Iceberg table that are updated by the MediaWiki Content Pipelines. These pipelines are based on both batch and streaming operations that every day, instead of the more traditional once-a-month cadence, provide new information on a wiki, page and revision level. Moreover, they fix data consistency problems using a reconciliation algorithm that checks data directly against MediaWiki's databases.
For the purposes of defining the SLO and creating the following document, everything will be based on the Mediawiki content history v1 table. Since Mediawiki content current v1 is a subset of that table, ensuring correctness in the former will automatically ensure correctness in the latter.
The specific components implemented as part of the overall product of MediaWiki Content History are described below.
Batch processing
A set of Airflow DAGs made of PySpark jobs build the batch components of the system. The MediaWiki Content History Daily DAG executes the following operations:
- Daily merge of page_content_change events.
- Daily merge of revision_visibility_change events.
- Daily merge of content_reconcile_enrichment events.
- Daily find missing data and generate reconciliation events.
- Merge reconciled event data into the Mediawiki content history v1 table.
These pipelines are responsible for the processing of mediawiki.page_content_change events, reconciliation of missing events against MediaWiki MariaDB replicas, and insertion into Mediawiki content history v1 Iceberg table.
Stream processing
mw-content-history-reconcile-enrich is a streaming application that consumes reconciliation events and enriches them with page content body and redirect info.
Teams
The Data Engineering team is responsible for MediaWiki Content History pipelines and relative tables.
Architectural
Environmental dependencies
Batch Processing
Data pipelines are implemented as PySpark jobs and are executed on the WMF Data Lake (Hadoop), a cluster of bare-metal machines deployed in eqiad. Data pipeline orchestration is managed via Airflow DAGs, deployed on the main Airflow instance, running on DSE Kubernetes in eqiad. The MediaWiki Content History DAGs can be easily identified with the tag mediawiki_content.
Stream Processing
The mediawiki-event-enrichment reconciliation events enrichment process is implemented as a PyFlink application, running on DSE Kubernetes in eqiad.
Reconciliation
MariaDB replica lag, erroneous responses from the Mediawiki Action API, or changes on databases that can't be captured by EventBus hooks, will impact the availability of page_content_change events.
See also Mediawiki Page Content Change Enrichment.
Service dependencies
For dependencies without an SLO yet, or dependencies that habitually miss their SLO, we assume that they maintain their historical performance, or worsen slightly but not dramatically (as recommended by the template instructions).

Hard
MediaWiki Content History has hard dependencies on:
- YARN and Kubernetes (dse-k8s)
- HDFS
Degradation of hard dependencies will impact the ability to access the table's content.
Soft
MediaWiki Content History has soft dependencies on:
- Gobblin and Refine ingestion pipeline (no SLO)
- MediaWiki Action APIs, queried for reconciliation event enrichment.
- MariaDB analytics replicas (no SLO), queried for reconciliation events emission.
Degradation of soft dependencies will impact the system’s ability to refresh the tables, and to produce data dumps on schedule.
Indirect
The API boundaries for data pipelines are HDFS and Hive. However, these datasets are downstream of Event Platform services. Therefore, MediaWiki Content Pipelines has indirect dependencies on these services:
- EventBus (SLO) produces page_change data.
mediawiki-page-content-change-enrich(SLO), consumespage_changeandpage_content_changeevents.- Kafka (main, jumbo)
- Airflow
Client-facing
The service clients are consumers of the Mediawiki content history v1 and Mediawiki content current v1 Iceberg tables.
Immediate client teams, internal to the Wikimedia Foundation include:
The following main downstream dependencies and their expectations were taken into account when formulating this SLO:
| Dependency | Team | Summary of Expectations |
|---|---|---|
| Image Suggestions | Growth, Structured Content (no longer active) | The table should not be unavailable for queries for longer than a week. Its data can lag a few days behind MediaWiki. Completeness is not so important, within reason (i.e. >70%). |
| Content Diff | Research | Content Diff is used as the source for various Machine Learning workflows, in batch inference of models. Guarantees around completeness are good, and any reconciliation (rewriting the past) could cause noticeable data gaps downstream. |
| Knowledge Gaps and Movement Metrics | Movement Insights | Movement Metrics is a monthly report, which partially depends MediaWiki Content History via the Knowledge Gaps pipeline. Completeness is important, freshness not so much: 3-5 days into a month, all data for the previous month should be present. |
Service Level Indicators (SLIs)
We define system health by ensuring that data landing in the wmf_content.mediawiki_content_history_v1 Iceberg table meets objectives of:
- completeness: ensures that the page and revision changes available in the MariaDB are available and accessible.
We measure these two objectives with the following indicators:
- Completeness SLI for mediawiki_content_history_v1: For any given UTC calendar day
D, this SLI is identified by the completeness of the table. Completeness is a metric defined as the percentage of availablepage_idandrevision_idrows for all the Wikis stored in the table of create, edit, and move kind with date <D. The table is considered complete if the percentage is above a given threshold. Finally, the revisions stored in the table are compared to the ones stored in MariaDB.
Example: On 2025-12-01 UTC, the system computes the percentage of page_id and revision_id combinations for all Wikis that are available in the wmf_content.mediawiki_content_history_v1 table compared to the corresponding records in the MariaDB analytics replicas.
The comparison only includes records where page_change_kind is create, edit, or move, as these represent the actionable page changes that should be captured by the content history pipeline. The metric looks backward from day 2025-12-01 (not inclusive of that date) up until the first record stored in the table.
Operational
Monitoring
mediawiki_content_history data pipelines are monitored by a combination of Grafana, Logstash and Airflow UI dashboards.
Batch processing
The status of Spark jobs can be monitored in the analytics instance Airflow UI:
- mw_content_merge_changes_to_mw_content_current_daily
- mw_content_merge_events_to_mw_content_history_daily
- mw_content_reconcile_mw_content_history_daily
- mw_content_reconcile_mw_content_history_monthly
Airflow will alert on failure by sending an email to data-engineering-alerts@wikimedia.org.
Stream processing
The Flink application emits time-series metrics (counters and gauges) for all SLIs, which are available in Grafana. Alerts are triggered and delivered by the Alert Manager.
Troubleshooting
Batch processing
Data pipelines have been implemented according to Airflow Developer Guide guidelines. Generic troubleshooting and operations is described in DPE Ops week page. Upon failure, data pipelines have to be re-run to backfill data.
Stream processing
mw-content-history-reconcile-enrich depends on Kafka and the Action API. Operational errors are expected to be correlated to the performance of either system. The application is deployed on k8s, integrates with the observability platform and follows deployment pipeline operational conventions.
The application emits errors (exceptions, invalid records, HTTP timeout after retries into a kafka error topic: eqiad.mw_content_history_reconcile_enrich.error. Logs are also forwarded to Logstash in ECS format.
The application is deployed with high-availability enabled within Kubernetes HA services, and checkpoints to Ceph. Upon application failure, enrichment will restart from the latest processed Kafka offset.
Runbook
The following runbook was created to allow easy querying of the metric.
Deployment
Batch Processing
Data pipelines DAGs are stored in the airflow-dags repository and are manually deployed on the analytics Airflow instance using scap. The deployment steps are documented on Wikitech. The source code is stored in the mediawiki-content-dump repository and is deployed through a CI/CD pipeline after a merge request is merged.
Stream Processing
The mw-content-history-reconcile-enrich application is deployed on dse-k8s-eqiad following deployment pipeline practices. The enrichment implementation is available in the mediawiki-event-enrichment monorepo and is packaged in the corresponding docker image.
Metric Processing
To compute the SLO metric an airflow-dag has been deployed, the code it executes is stored in the mediawiki-content-dump repository and is deployed through a CI/CD pipeline after a merge request is merged.
Service Level Objectives
- Completeness SLO for mediawiki_content_history_v1: On at least 85% of the days in 4-week rolling window (24 out of 28 days), MediaWiki Content History contains more than 99.5% of the source
page_idandrevision_idrows of create, edit, and move kind, for all the Wikis, compared to the MariaDB analytics replicas.
The following dashboard was built to track the status of the metric.
Assumptions
Reasons behind the threshold
The pipeline has been tested to determine how many days would be needed to impact the performance of the final result. The completeness obtained by the pipeline in a regular environment (MWCH table is up to date and accessible, the source tables are updated and accessible, the date use to filter is correct and available) returned an average score of 99.98%. Therefore, at the time of writing, the SLO would already trigger in case 99.99% was picked as a threshold. Based on this preliminary result, a series of experiments was conducted to gradually reduce the content of the MWCH table and determine how many days would be required before the outcome was significantly affected.
First constraint in picking a threshold was the ability of metric to be resilient to a full weekend outage, given that the Data Engineering team is not on call for failures. Here are the results:
| result | missing day | scenario day |
|---|---|---|
| 0.99981927 | 0 | Thursday |
| 0.99960704 | 1 | Friday |
| 0.99935919 | 2 | Saturday |
| 0.99912513 | 3 | Sunday |
| 0.99889050 | 4 | Monday |
In a fictitious scenario of a work week with an outage starting on a Friday is easy to see how on the next Monday the metric would have lost a whole point (from .999 to .998). Therefore if 99.9% was picked as threshold it would have raised an alert.
Discarded this option, the chosen threshold is 99.5% since it gives us enough time to pinpoint the source of the malfunction and carry out the backfilling needed to restore the data, as shown by the following results:
| result | missing day |
|---|---|
| 0.99981927 | 0 |
| 0.99889050 | 4 |
| 0.99793433 | 9 |
| 0.99685118 | 14 |
| 0.99582976 | 19 |
| 0.99496714 | 24 |
It would take 19 consecutive days to reach the 99.5% completeness and additionally 5 days trigger the alarm. This choice gives us enough time to do all the necessary actions to restore the table.
Why not just measuring the completeness percentage?
The first iteration of the SLI and SLO definitions were taking into account just the percentage of completeness and that everyday the value has to be >99.5%. This was not well received by Pyrra due to the nature of the data pushed to the Prometheus gateway, explanation can be found here. To allow a growing daily metric it was decided to switch the values to:
- number days of computation of the metric
- number of alerts triggered.
An alert is triggered when the completeness of the table is under the 99.5% threshold.
This way the metrics are always growing and behave like a counter, both qualities that Pyrra appreciates.
Why the date is excluded in the SLI computation
The date provided in the example (2025-12-01) is excluded from the data selection due to the nature of the table update: everyday the wmf_content.mediawiki_content_history_v1 is updated with the data of the previous day. Therefore, on the execution date of the check using this value would provide no new data but just what is previous to it. The table below shows the situation described:
| pipeline execution date | queryable date of the table |
|---|---|
| 2025-12-01 | 2025-11-30 |
| 2025-12-02 | 2025-12-01 |
| 2026-01-01 | 2025-12-31 |