Data Platform/Data Lake/Data Issues/2025-04-11 Mediawiki History duplicate revisions and excess reverts
2025-04-11 Mediawiki History duplicate revisions and excess reverts
Status | Closed | |
Severity | High | |
Business data steward | ||
Technical data steward | Andreas Hoelzl | |
Incident coordinator | Andreas Hoelzl | |
Incident response team | Data Engineering, Neil Shah-Quinn | |
Data detected | 2025-04-11 | |
Date resolved | 2025-04-28 | |
Start of issue | 2023-04-02 | |
Phabricator ticket | T391708 |
Summary
Key investigation findings
The 2025-03 snapshot of mediawiki_history has 256 M duplicate revision-create events, whereas the previous snapshot had none (uniqueness criteria).
- These duplicates are 3.3% of total revision-create events (5.1% when counting the first instance of a duplicated row).
- The affected revisions often have more than one duplicate.
- Other event types are not affected (there are duplicates of these event types, but the numbers are consistent with the previous snapshot).
- There does not seem to be any duplication in the sqooped tables in wmf_raw, so the issue seems to have happened during the generation of mediawiki_history itself.
- Some groups of duplicates have inconsistent values for event_user_revision_count, revision_is_identity_revert, and revision_is_identity_reverted.
Additionally, the share of revisions with revision_is_identity_revert is abnormally high in the 2025-03 snapshot. It is by far highest among duplicated revisions, but even among non-duplicated revisions is almost 3 times higher than in the previous snapshot. So, confusingly, it seems to separate from but still influenced by the revision duplication.
Root cause
The mediawiki history pipeline was executed twice for the 2025-03 snapshot run:
- 2025-04-01 … took 9.6 hrs, failed
- 2025-04-02 … took 18.6 hrs, succeeded … (typical runtime so far was 7.5 hours)
Investigations of the Spark history servers showed a high number of failed tasks and re-try attempts.
Re-tries of Spark tasks depending on the shuffle stage are not deterministic leading to unpredictable data corruptions.
None of the data analysis results pointed in a conclusive source data direction or failure pattern and an attempted re-run without any parameter change resulted in yet different failure results.
All other potential impacts of recent data platform changes (enabling of Spark data lineage, Airflow migration to k8s, temp accounts) could successively be ruled out as causes.
We hence conclude that the pipeline was impacted by cluster resource constraints.
Interestingly we couldn’t find any significantly high cluster load (CPU, HDFS) during that time caused by other processes. We suspect that the pipeline has breached a critical data size threshold where previously defined runtime parameters are no longer valid.
Re-running the pipeline on fewer execution cores yielded successful results.
Affected datasets and services
- Primarily impacted
- mediawiki_history
- druid.edits_hourly
- wmf.edit_hourly
- stats.wikimedia.org
- Indirectly impacted
- trust and safety’s data pipelines need to be rerun. Similar dashboards that rely on mediawiki_history will be affected as well.
- See also https://codesearch.wmcloud.org/search/?q=requires_wmf_mediawiki_history&files=&excludeFiles=&repos=
Resolution
Re-run on 4/24 with executor_cores=2 yielded good results with duplicates back to nominal values.