Jump to content

Incidents/2025-07-17 Cite + VisualEditor list-defined reference disappearances

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-07-17 Cite + VisualEditor list-defined reference disappearances Start 2025-07-17 16:12
Task T400038 End 2025-07-21 09:30
People paged 0 Responder count 5
Coordinators Johannes Richter (WMDE), WMDE-Fisch Affected metrics/SLOs No relevant SLOs exist
Impact When an article containing list-defined refs was visually edited, no matter what was edited it would be saved without these list-defined refs.

WMDE Technical Wishes deployed a change to the Cite extension which changed how Visual Editor serializes the <references> tag when saving edits. This tag contains all of the "list-defined references", and Parsoid looks in the data-mw attribute which we had anticipated, but also has a supplementary check for a DOM <li> element with a known ID. The deployed bad code omitted this ID which caused Parsoid to assume that all list-defined references had been deleted from the document. Any page edited and saved would lose its list-defined references.

The impact was limited because many wikis and articles don't commonly use list-defined references, and several of the larger wikis wrap all references tags in a template such as {{reflist}}.

Timeline

All times in UTC.

  • 2025-07-01 12:04 Works begins on task T396017 which will later lead to the outage.
  • 2025-07-11 11:04 Problematic patch 1168120 is merged and will be deployed with the next train.
  • 2025-07-17 16:12 OUTAGE BEGINS: Version 1.45.0-wmf.10 was rolled out to group2.
  • 2025-07-19 22:05 User:Gesetzesfreak runs into the bug, which causes a vandalism discussion.
  • 2025-07-20 21:17 Johannes Richter (WMDE) files a bug report as task T400013.
  • 2025-07-20 21:35 David Lynch leaves a note on WMF Slack pointing to T400013.
  • 2025-07-21 05:31 Johannes Richter notifies WMDE Tech Wishes of the bug.
  • 2025-07-21 06:15 Bug assessment makes it clear that the issue affects all wikis and task T400038 is created.
  • 2025-07-21 07:30 OUTAGE ENDS: The commit that caused the issue was reverted and backported to the Wikimedia production cluster.
  • 2025-07-23 08:37 All impacted edits are identified with automation in task T400053 and manual verification. They are resolved on-wiki.
  • 2025-07-23 14:00 Notified the seven Wikipedia language versions affected by the bug with a list of affected articles per project (example).

Detection

First noticed by a user and brought to a German Wikipedia talk page.

Conclusions

What went well?

  • The team's community communications staff member became aware of the issue quickly.
  • It was possible to identify the problematic patch and revert it in isolation.
  • Impacted edits had a subtle signature but it was still possible to make an exhaustive list, with only a reasonable number of false positives.
  • The team had capacity to make the issue our top priority and was able to resolve it ourselves.
  • Wikipedia editors on several affected wikis shared appreciation of our communication and our handling of the issue.

What went poorly?

  • We learned too late of a large gap in our automated testing: cases which wire actual Visual Editor outputs as Parsoid inputs.
  • The tasks were never marked "Unbreak Now!" as they should have been once we realized there was editor-facing impact and data loss.
  • Reverting the impacted edits caused well-intended user content to be deleted.
  • Some editors were accused of vandalism because of a software glitch out of their control.
  • The responding team could have added more updates on the task, letting observers know about our progress on the issue.
  • The WMDE software department does not have clear processes for how to deal with on-wiki incidents.

Where did we get lucky?

  • The conditions for triggering the bug were very specific, and only 49 revisions (which weren't already vandalism) were affected.
  • The resulting article content included Cite errors which are visible as red text and appear in maintenance categories, which is what alerted various editor communities.

Actionables

  • task T401335: Spike: define operational monitoring requirements for Cite error alerting
  • task T400311: Investigation: Write visual editor debug tool to produce Converter test cases
  • task T401334: Split out reusable Parsoid+Cite analysis module from scraper
  • task T400803: Tech debt: review uses of references list item id during Parsoid html2wt

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 10