Incidents/2022-12-09 MediaWiki REST API

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-12-09 MediaWiki REST API Start 2022-12-06 12:27:18
Task T324801 End 2022-12-09 02:39:00
People paged 14 Responder count 5
Coordinators Legoktm Affected metrics/SLOs
Impact Visual diffs would have shown no changes (common feature, ~100,000 users of the beta feature on enwiki), and editing old revisions using the VisualEditor would not have worked (rare)

A change in the MediaWiki REST API caused requests for old revisions to serve the current revision. It was noticed because visual diffs were all indicating no changes happened.

Timeline

All times in UTC.

2022-12-06

  • 12:27 MediaWiki 1.40.0-wmf.13 deployment starts to group0 (problems begin)

2022-12-08

2022-12-09

  • 01:55 Arlolra alerts Transformers group chat that there's an active UBN
  • 02:04 Legoktm #pages _security channel and then uses Klaxon
  • 02:08 cwhite, legoktm, TheresNoTime start discussing in #wikimedia-operations
  • 02:08 cscott recommends a train rollback on Phabricator
  • 02:17 cwhite begins train rollback from group2 to group1
  • 02:39 train rollback complete (problems resolved on group2 wikis, but continue on group0 and group1 wikis)
  • 04:15 ssastry identifies the culprit via git bisect on Phabricator
  • 04:52 ssastry uploads initial patch https://gerrit.wikimedia.org/r/866527 (PS2 uploaded at 05:15, 7 minutes after CI fails)
  • 08:14 dkinzler C+2s
  • 08:37 merged to master
  • 10:41 ladsgroup cherry-picks to 1.40.0-wmf.13
  • 10:51 ladsgroup: backport deployed to testservers (SAL)
  • 11:00 initial testing by ladsgroup on group1 itwiki (still running wmf.13) is inconclusive due to caching
  • 11:02 ladgroup completes deploy after resolving test issue (fix is now live on group0 and group1) (SAL)
  • 13:13 hashar rolls group2 forward to wmf.13

Detection

The issue was manually detected by a human.

Conclusions

What went well?

  • Train rollback resolved the issue on group2 wikis, no immediate issues or regressions caused by the rollback itself

What went poorly?

  • The bug made it all the way to group2 before being noticed by a human
    • Phpunit tests for the ParsoidHandler::wt2html() method in MW core existed but were using the current revision ID.
    • Mocha tests for the page/html/{title}/{revid} endpoint in the Parsoid extension existed but were using the current revision ID.
    • No tests exist for the visual diff feature in VisualEditor.
    • The visual diff feature in VisualEditor isn't accessed frequently enough in group0/group1 to facilitate earlier detection.
  • The train rollback took 22 minutes, updating wikiversions in case of rollback should be near immediate
  • The bug was detected fairly late in the evening US time. The author of the root cause patch was offline (european time) and the Content-Transform-Team members had to scramble and cancel/rearrange dinner and evening plans.

Where did we get lucky?

  • The train rollback was safe. Had there been a risky (non-revertible) change, it would have needed more time to identify the problematic commit and selectively revert.
  • No regressions from the Parsoid version downgrade, which was an untested scenario. After the rollback, RESTBase will continue to serve 2.7.0 version content (for titles where that version had been stored) whereas Parsoid (after rollback) doesn't know about version 2.7.0. There was a concern the REST API would reject Parsoid HTML v2.7.0 when submitted for saves. But, that didn't manifest. It would be useful to document and test this rollback behavior more thoroughly.
  • There's a similar possible rollback issue with ParserCache changes, and Subbu has been working on changes to the representation of Table of Contents information in the ParserCache. These rollback paths have more CI support for ensuring backward compatibility, but /forward/ compatibility of parser cache contents is only enforced by using a disciplined patch sequence which is (AFAIK) undocumented. In the worse-case scenario, we would have discovered that an unrelated parsercache migration also prevented a clean rollback.
  • The content transform team has a out-of-band signal chat for emergencies but which is probably not useful to wake up members in the night. The incident coordinator, legoktm, was on that channel having participated in a previous year's offsite, and so was able to use it to quickly contact the content transform team and coordinate a response. If the incident had occurred a few hours later, both the north american and european responders might have been asleep.
  • Subbu was able to work late into his evening and prepare an initial fix, which was then reviewed and deployed early in the morning europe time.

Links to relevant documentation

  • It seems like there should be a guide that says (eg) "the visual diff component is powered by RESTBase" -> "RESTBase uses the Parsoid core API; problem with RESTBase are likely caused by either RESTBase (deploy log here) or the Parsoid core API" -> "problems with the Parsoid core API are likely caused by either the API code (mediawiki-core:includes/Rest) or the Parsoid library (deploy log here); in either case rollback will fix the problem". That would allow an SRE to determine (a) there hadn't been a recent RESTBase deploy, so that was unlikely, so (b) rollback is the best option.
  • Note that the incident was caused by a refactoring that is aiming to cut RESTbase out of the flow described above. In the future, visual diff should use the v1/revision/{id}/html endpoint in core. It could probably even start doing this right now.
  • …

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

  • Automated testing to verify behavior of requesting HTML of old revisions from the REST API (https://gerrit.wikimedia.org/r/866621, https://gerrit.wikimedia.org/r/866622)
    • Should examine why/how Parsoid's "old" api-testing pathways regressed, which would/should have covered this code. Are there other tests from the old api-testing suite which have similarly vanished?
      • The tests for retrieving revision HTML by revision ID existed and was covering this path - but the revision ID it was using for the test was the current revision. We never had a test specifically for an old revision. This has now been added. daniel (talk) 16:54, 15 December 2022 (UTC)[reply]
  • There are alternate 'easier' ways to do rollbacks now. Update documentation.
  • Is the 22-minute long sync-wikiversions expected? No.
  • Documentation to make it clearer that rollback was the correct solution to this problem.
  • Something around testing forward- and backward-compatibility of Parsoid/RESTBase version downgrades
  • Something around testing forward-compatibility of ParserCache changes
    • The documentation doesn't just need to exist, it needs to be discoverable in the the right spot. Probably in a place that developers would touch when trying to implement backwards compatibility.
  • Is there some way to enhance the visibility of breakages in the "visual diff" feature, to make it more likely regressions in the feature will be caught during group0 or group1 rollout?
    • It would be nice to have Selenium tests for this feature. Selenium tests are slow, brittle, and tricky to set up locally. It would help if we could streamline this.
  • Reduce the gap between the UBN being filed and the relevant team springing into action
    • Platform Engineering currently doesn't have a notification mechanism for UBNs. A bot posting to Slack would be helpful.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no batphone because after business hours
Were pages routed to the correct sub-team(s)? no all SRE were paged, Content Transform Team was not paged
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the incident status section actively updated during the incident? no This wiki page was created before the incident was resolved, but Kunal the IC didn't have access needed to create a incident status google doc.
Was the public status page updated? no
Is there a phabricator task for the incident? yes T324801
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? no See above regarding usage of Signal and #page in #mediawiki_security
Did existing monitoring notify the initial responders? no Issue was manually detected by a human
Were the engineering tools that were to be used during the incident, available and in service? yes phabricator, klaxon, scap sync-wikiversions worked
Were the steps taken to mitigate guided by an existing runbook? yes Heterogeneous_deployment/Train_deploys#Rollback
Total score (count of all “yes” answers above) 7