Jump to content

Incidents/2025-05-07 Mobileapps + Cite

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-05-07 Mobileapps + Cite Start 2025-04-30 08:17:00
Task T393134 End 2025-05-06 21:48:00
People paged ? Responder count 6
Coordinators ? Affected metrics/SLOs ?
Impact All mobile app readers on Android and iOS saw all footnotes rendered as </nowiki>[0]</nowiki> during the outage. Additionally, an initial fix caused footnotes to be unclickable in the mobile apps from 2025-05-06 09:30:00 until 2025-05-06 21:48:00.

A long-term migration project has been changing how Parsoid Cite footnote markers are rendered. They were previously rendered in HTML as Western Arabic numerals (0..9) and localized using CSS, and after the migration shifted to always emitting the localized form directly in HTML. The two numbering systems are in conflict and the WMDE Technical Wishes team had been suppressing this problem by applying a transitional CSS rule to ignore old-style localization. In wmf.27 the Technical Wishes team removed this transitional rule under the belief that all remaining instances of the old localization had been deleted, and to shake out any remaining incompatibilities. Mobile apps HTML included a copy of the old footnote marker styling which had not been discovered by the Technical Wishes team, and the combination of old CSS plus new HTML caused multiple user-facing impacts.

Timeline

Write a step by step outline of what happened to cause the incident, and how it was remedied. Include the lead-up to the incident, and any epilogue.

Consider including a graphs of the error rate or other surrogate.

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

All times in UTC.

  • 2025-04-30 08:17 (SAL) Train deploys 1.44.0-wmf.27 including patch 1131684 to group1. OUTAGE BEGINS
  • 2025-05-02 00:03 First user reports the bug as task T393134
  • 2025-05-02 17:10 Correct identification of the root cause and pinging WMDE Technical Wishes team members (Phab)
  • 2025-05-02 21:09 First response from WMDE Technical Wishes, with suggested remediation (Phab)
  • 2025-05-06 09:57 (SAL) Initial fix attempt breaks clicking on footnote markers
  • 2025-05-06 19:43 (SAL) Mobileapps fixes fully deployed. OUTAGE ENDS

TODO: Clearly indicate when the user-visible outage began and ended.

Detection

User reported the bug in task T393134 (additionally reported by users under duplicate bugs task T393248, task T393285, task T393310,task T393342, task T393390, task T393421, and task T393471).

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

OPTIONAL: General conclusions (bullet points or narrative)

What went well?

  • Root cause was correctly identified early on. Multiple development teams and some editors were generally aware of the ongoing migration and its potential risks.

What went poorly?

  • Specific deployment of a potential breaking change should have been announced by WMDE Technical Wishes, especially to make the Content Transform team aware, but it was not announced.
  • The problematic patch 1131684 could have been reverted, but this was complicated by many merge conflicts with other recent work, and by the need to regerate tests in the repository.
  • May 1 is a holiday in Germany but not in the US, so none of the WMDE developers were available to monitor or respond to the train deployment until May 5. Similarly, the Wikimedia Hackathon and related travel may have reduced the pool of responders.
  • Tests in the Mobileapps repository are offline and use stored HTML output, which makes it impossible to detect incompatibilities with the upstream data source.
  • WMDE Technical Wishes did not search exhaustively enough for old CSS rules in unexpected places.
  • Users reported the bug rather than it being caught by automated testing or noticed by staff.
  • Mobileapps team was running a marketing campaign in Japan, which amplifies the impact of this issue.

Where did we get lucky?

  • Users discovered the issue within 24h and filed many discoverable tasks in Phabricator.
  • task T370027 is the related epic for migrating footnote markers to use "explicit numbering". task T383769 is the finalizing work which triggered the incident.

Actionables

  • TODO: add alerting
  • TODO: online, integration test for mobileapps

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents?
Were the people who responded prepared enough to respond effectively
Were fewer than five people paged?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours.
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
Was a public wikimediastatus.net entry created?
Is there a phabricator task for the incident?
Are the documented action items assigned?
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
Were the people responding able to communicate effectively during the incident with the existing tooling?
Did existing monitoring notify the initial responders?
Were the engineering tools that were to be used during the incident, available and in service?
Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)