Incidents/2023-01-17 MediaWiki
document status: final
Summary
Incident ID | 2023-01-17 MediaWiki | Start | 2023-01-17 18:31:11 |
---|---|---|---|
Task | T327196 | End | 2023-01-17 18:44:23 |
People paged | Sukhbir Singh | Responder count | 2 volunteer sysadmins, 1 SWE, plus some SREs arriving later |
Coordinators | Taavi (de facto) | Affected metrics/SLOs | not sure |
Impact | For roughly 15 minutes, all wikis were unreachable for logged-in users and non-cached pages. |
An issue with inconsistent state of deployed code during a backport deployment caused MediaWiki to crash for all logged-in page views.
The root cause of this issue was MediaWiki re-reading and applying changes to extension.json before a php-fpm restart would have picked up changes to the PHP code.
Timeline
All times in UTC.
- 18:16 A volunteer starts to backport five patches (3 config changes, one core change and one DiscussionTools change)
- 18:31 The patches get deployed to the canaries. Some errors are returned, although the canary checks don't prevent the patch from moving forwards presumably due to the short time between the file sync and php-fpm restarts.
- 18:33 Sync to the entire cluster starts. OUTAGE BEGINS
- 18:34:51 First user report on IRC
- 18:34:55 First automated alert: <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2.651e+04 gt 100
- 18:35:42 Backport is cancelled by the deployer. This was during the k8s sync phase, before php-fpm restarts started. This leaves the cluster in an inconsistent state between code on disk and code running until the revert is deployed.
- 18:37:43 First paging alert: <+jinxer-wm> (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page
- 18:39 Revert sync starts
- 18:48 Revert sync is done OUTAGE ENDS
- ~19:30 A revised patch is synced successfully
Detection
Humans and automated alerts detected the issue quickly, and the alert volume was manageable.
Conclusions
What went well?
- The issue was detected early and the offending patch could be isolated early
- Technically in this case time-to-recovery would have been faster if the backport would not have been cancelled. However, in general, the author believes that it is a good thing that the deployer noticed the issue quickly and the first instinct was to deploy a revert, since that is the fastest way to recovery in most cases of faulty patches in this process.
- The revert worked as is
What went poorly?
- Canary checks did not catch the issue
- Due to the scap backport workflow, deployers don't think about sync order like scap sync-file forced them to do so.
- Developers and deployers did not expect that PHP code and other files (extension.json) would not be synced exactly at the same time.
- php-fpm restarts take ages to complete, and the Kubernetes deploys happen in between the file sync and php-fpm restarts
- wikimediastatus.net was not updated
Where did we get lucky?
- not sure, I think everything went as expected
Links to relevant documentation
Actionables
- TODO: can canary checks detect this issue?
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | responders were deployers, not SREs |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | yes | pages were only sent out right before mitigating action | |
Were pages routed to the correct sub-team(s)? | yes | The page was not routable to sub-teams because it originated in the frontend
(Basically the sub-team was everyone) | |
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | no incident doc was created because it was not necessary to create one |
Was a public wikimediastatus.net entry created? | no | responders did not have access or training on wikimediastatus.net usage | |
Is there a phabricator task for the incident? | yes | created by the community | |
Are the documented action items assigned? | yes | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
no | Serve production traffic via Kubernetes: https://phabricator.wikimedia.org/T290536 |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | yes | Backport_windows/Deployers#Reverting, although frequent deployers usually are familiar with the process and don't use the documentation | |
Total score (count of all “yes” answers above) | 13 |