Incidents/2023-01-09 API errors - db1143

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-01-09 API errors - db1143 Start 2023-01-09 21:04:00
Task T326590 End 2023-01-09 21:14:00
People paged 2 Responder count 4
Coordinators dzahn Affected metrics/SLOs API appservers
Impact elevated 5xx, API errors, wbsgetsuggestions failed to return suggestions on Wikidata


Effect for users

  • We served elevated 5xx to end users for about 10 minutes, between 21:04 and 21:14
  • Wikidata users noticed they could not get suggestions anymore and created T326590

Probable cause / chain of events

  1. Someone edited a page or refreshlinks on any of the wikis.
  2. This triggers public static function onLinksUpdateComplete which tries to see if a commons file usage has changed.
  3. A database server in cluster S4 which hosts commons, gets overloaded with 300 queries that are not slow individually but overwhelm the system.
  4. Hundreds of queries of the type SELECT * MediaWiki\Extension\GlobalUsage\GlobalUsage::getLinksFromPage ... are created.
  5. The number of queries on s1 (en.wp) drops.
  6. The affected server db1143 gets depooled, things recover almost instantly.
  7. A massive spike of reads happens on the s1 database cluster, which hosts en.wikipedia, due to the recovery.
  8. A spike of POST traffic to Parsoid happened after after api-appserver recovered, which then saturated parsoid.

Graphs

5xx spike on appservers


Timeline

21:02 - first alerts show up on IRC, bot notifies about 'High average GET latency on API appservers'

21:02 - an alerting page was sent for "ProbeDown (ip4_api-https) (3215)". It was ACKed within the same minute (46241)

21:03 - an alerting page was sent for: "PHPFPMTOOBusy api_appserver (3216)"

21:04 - An incident was opened.  dzahn becomes IC.

21:07 - dzahn sends manual page to SRE via Klaxon to ask for help.

21:09 - DBA marostegui joins and immediately identifies db1143 as non-response, depools it from service

21:10 - A shower of recoveries starts almost instantly, traffic recovers.

21:10 - https://phabricator.wikimedia.org/T326590 is created by users noticing broken suggestions service on Wikidata.

21:12 - recovery page is sent for event 3215

21:13 - recovery page is sent for event 3216

21:17 - recovery page is sent for event 3218 *ipv4 probes prometheus eqiad*, api-https has failed probes

21:18 - recovery page is sent for event 3219, *PHPFPMTooBusy parsoid*

21:24 - recovery page is sent for event 3220 *MariaDB Replica Lag: s4*

Detection

alerting via icinga/alertmanager/victorops - SMS was sent to SREs on on-call duty shift. manual escalation via Klaxon to more SREs.

Conclusions

Maybe onLinksUpdateComplete should not always lok for commons file usage changes or have some throttle.

What went well?

  • alerting worked, reaction time was fast
  • DBA was available and able to identify and depool the relevant server quickly which almost instantly solved the issue

What went poorly?

  • We got more recovery pages than original alert pages or at least some of us and we are not sure why.

Where did we get lucky?

  • That this doesn't happen more often.

Links to relevant documentation

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes 2 on-callers but they escalated to SRE via klaxon
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no but the incident just lasted ~10 minutes, so that was ok
Was a public wikimediastatus.net entry created? yes but only retroactively. we did not know if it was an attack at first
Is there a phabricator task for the incident? yes kind of, https://phabricator.wikimedia.org/T326590 was created and was caused by it but is about one specific effect of it and user created
Are the documented action items assigned? n/a
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes There might be a ticket about onLinksUpdateComplete but this needs input from mw developers.
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no it was using dbctl, not sure if "db server is overloaded -> depool" counts as runbook
Total score (count of all “yes” answers above)