Incidents/2023-01-17 MediaWiki

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-01-17 MediaWiki Start 2023-01-17 18:31:11
Task T327196 End 2023-01-17 18:44:23
People paged Sukhbir Singh Responder count 2 volunteer sysadmins, 1 SWE, plus some SREs arriving later
Coordinators Taavi (de facto) Affected metrics/SLOs not sure
Impact For roughly 15 minutes, all wikis were unreachable for logged-in users and non-cached pages.

An issue with inconsistent state of deployed code during a backport deployment caused MediaWiki to crash for all logged-in page views.

The root cause of this issue was MediaWiki re-reading and applying changes to extension.json before a php-fpm restart would have picked up changes to the PHP code.

Timeline

All times in UTC.

  • 18:16 A volunteer starts to backport five patches (3 config changes, one core change and one DiscussionTools change)
  • 18:31 The patches get deployed to the canaries. Some errors are returned, although the canary checks don't prevent the patch from moving forwards presumably due to the short time between the file sync and php-fpm restarts.
  • 18:33 Sync to the entire cluster starts. OUTAGE BEGINS
  • 18:34:51 First user report on IRC
  • 18:34:55 First automated alert: <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2.651e+04 gt 100
  • 18:35:42 Backport is cancelled by the deployer. This was during the k8s sync phase, before php-fpm restarts started. This leaves the cluster in an inconsistent state between code on disk and code running until the revert is deployed.
  • 18:37:43 First paging alert: <+jinxer-wm> (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page
  • 18:39 Revert sync starts
  • 18:48 Revert sync is done OUTAGE ENDS
  • ~19:30 A revised patch is synced successfully

Detection

Humans and automated alerts detected the issue quickly, and the alert volume was manageable.

Conclusions

What went well?

  • The issue was detected early and the offending patch could be isolated early
    • Technically in this case time-to-recovery would have been faster if the backport would not have been cancelled. However, in general, the author believes that it is a good thing that the deployer noticed the issue quickly and the first instinct was to deploy a revert, since that is the fastest way to recovery in most cases of faulty patches in this process.
  • The revert worked as is

What went poorly?

  • Canary checks did not catch the issue
  • Due to the scap backport workflow, deployers don't think about sync order like scap sync-file forced them to do so.
    • Developers and deployers did not expect that PHP code and other files (extension.json) would not be synced exactly at the same time.
  • php-fpm restarts take ages to complete, and the Kubernetes deploys happen in between the file sync and php-fpm restarts
  • wikimediastatus.net was not updated

Where did we get lucky?

  • not sure, I think everything went as expected

Links to relevant documentation

Actionables

  • TODO: can canary checks detect this issue?

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes responders were deployers, not SREs
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes pages were only sent out right before mitigating action
Were pages routed to the correct sub-team(s)? yes The page was not routable to sub-teams because it originated in the frontend

(Basically the sub-team was everyone)

Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes no incident doc was created because it was not necessary to create one
Was a public wikimediastatus.net entry created? no responders did not have access or training on wikimediastatus.net usage
Is there a phabricator task for the incident? yes created by the community
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

no Serve production traffic via Kubernetes: https://phabricator.wikimedia.org/T290536
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? yes Backport_windows/Deployers#Reverting, although frequent deployers usually are familiar with the process and don't use the documentation
Total score (count of all “yes” answers above) 13