Jump to content

Incidents/2025-05-07 Mobileapps + Cite

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-05-07 Mobileapps + Cite Start 2025-04-30 08:17:00
Task T393134 End 2025-05-06 21:48:00
People paged 1 Responder count 6
Coordinators DBrant (WMF) Affected metrics/SLOs ?
Impact All mobile app readers on Android and iOS saw all footnotes rendered as [0] during the outage. Additionally, an initial fix caused footnotes to be unclickable in the mobile apps from 2025-05-06 09:30:00 until 2025-05-06 21:48:00.

A long-term migration project has been changing how Parsoid Cite footnote markers are rendered. They were previously rendered in HTML as Western Arabic numerals (0..9) and localized using CSS, and after the migration shifted to always emitting the localized form directly in HTML. The two numbering systems are in conflict and the WMDE Technical Wishes team had been suppressing this problem by applying a transitional CSS rule to ignore old-style localization. In wmf.27 the Technical Wishes team removed this transitional rule under the belief that all remaining instances of the old localization had been deleted, and to shake out any remaining incompatibilities. Mobile apps HTML included a copy of the old footnote marker styling which had not been discovered by the Technical Wishes team, and the combination of old CSS plus new HTML caused multiple user-facing impacts.

Timeline

All times in UTC.

  • 2025-04-30 08:17 (SAL) Train deploys 1.44.0-wmf.27 including patch 1131684 to group1. OUTAGE BEGINS
  • 2025-05-02 00:03 First user reports the bug as task T393134
  • 2025-05-02 17:10 Correct identification of the root cause and pinging WMDE Technical Wishes team members (Phab)
  • 2025-05-02 21:09 First response from WMDE Technical Wishes, with suggested remediation (Phab)
  • 2025-05-06 09:57 (SAL) Initial fix attempt breaks clicking on footnote markers
  • 2025-05-06 19:43 (SAL) Mobileapps fixes fully deployed. OUTAGE ENDS

Detection

User reported the bug in task T393134 (additionally reported by users under duplicate bugs task T393248, task T393285, task T393310,task T393342, task T393390, task T393421, and task T393471). No automatic alerts were fired; see Actionables.

Conclusions

What went well?

  • Root cause was correctly identified early on. Multiple development teams and some editors were generally aware of the ongoing migration and its potential risks.

What went poorly?

  • Specific deployment of a potential breaking change should have been announced by WMDE Technical Wishes, especially to make the Content Transform team aware, but it was not announced.
  • The problematic patch 1131684 could have been reverted, but this was complicated by many merge conflicts with other recent work, and by the need to regerate tests in the repository.
  • May 1 is a holiday in Germany but not in the US, so none of the WMDE developers were available to monitor or respond to the train deployment until May 5. Similarly, the Wikimedia Hackathon and related travel may have reduced the pool of responders.
  • Tests in the Mobileapps repository are offline and use stored HTML output, which makes it impossible to detect incompatibilities with the upstream data source.
  • The mobileapps service (PCS) contains local copies of styles (.less files) copied manually from various extensions, including Cite. There is no automatic mechanism to keep these styles up to date, so they are bound to diverge from styles expected by the HTML output.
  • WMDE Technical Wishes did not search exhaustively enough for old CSS rules in unexpected places.
  • Users reported the bug rather than it being caught by automated testing or noticed by staff.
  • Mobileapps team was running a marketing campaign in Japan, which amplifies the impact of this issue.

Where did we get lucky?

  • Users discovered the issue within 24h and filed many discoverable tasks in Phabricator.
  • task T370027 is the related epic for migrating footnote markers to use "explicit numbering". task T383769 is the finalizing work which triggered the incident.

Actionables

  • TODO: Update the PCS service (mobileapps) to no longer keep local copies of styles, and instead update styles dynamically through ResourceLoader. gerrit:1082532
  • TODO: Add and enhance online integration tests in the mobileapps service.
  • TODO: Add alerting if tests start failing.

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? N/A
Were the people who responded prepared enough to respond effectively Y
Were fewer than five people paged? N/A
Were pages routed to the correct sub-team(s)? N/A
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. N/A
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? N/A
Was a public wikimediastatus.net entry created? N
Is there a phabricator task for the incident? Y
Are the documented action items assigned? N
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? Y
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. N phab:T369435
Were the people responding able to communicate effectively during the incident with the existing tooling? Y
Did existing monitoring notify the initial responders? N
Were the engineering tools that were to be used during the incident, available and in service? Y
Were the steps taken to mitigate guided by an existing runbook? N
Total score (count of all “yes” answers above) 5