Incidents/2023-02-22 wiki outage

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-02-22 wiki outage Start 2023-02-22 9:18
Task End 2023-02-22 9:45
People paged 2 Responder count 2
Coordinators Filippo Giunchedi Affected metrics/SLOs
Impact For approximately 18 minutes, around 17% of incoming, non-multimedia Wikimedia traffic received a 503, 500 error or were missing (most requests coming from eqiad and esams geolocated clients and using our cache layer: wikis, Phabricator, Grafana, ...)

During a routine maintenance consisting of upgrading HAProxy on cache hosts, all of the backends (ATS) in the text cache cluster in esams and eqiad were accidentally depooled due to a mismatch on the maintenance run between depooling the hosts individually and pooling back the cdn. This caused both cached and uncached traffic requests for wikis and other ATS-backed services to fail and return errors to clients, mostly in parts of Europe, Africa and Asia. Approximately 17 million HTTP requests (according to varnish) / 5 million user requests (according to NEL estimation) errored out in total. Editing rate was reduced to less than half. Upload cluster, clients geolocated to drmrs, codfw, ulsfo or eqsin, and GET requests cached in memory were not affected.

Timeline

08:57 vgutierrez@cumin1001:~$ sudo -i cumin -b1 -s60 'A:cp-text_esams' 'depool && sleep 30 && DEBIAN_FRONTEND=noninteractive apt-get -q -y --assume-no -o DPkg::Options::="--force-confdef" install haproxy && run-puppet-agent -q && systemctl restart haproxy && sleep 5 && pool cdn' # note the mismatch between depool and pool cdn

09:18 Esams fully depooled Outage starts here

09:20 pages start rolling in

09:24: <hashar> I am going to rollback to rule out the train

09:25 Updated https://www.wikimediastatus.net/

09:30  <vgutierrez> esams isn't able to reach appservers-ro or api-ro for some reason

09:31 <hashar> (train rolled back)

09:34 <logmsgbot> !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=esams,service=ats-be,cluster=cache_text

<logmsgbot> !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=ats-be,cluster=cache_text

09:36 Outage stops here

09:35  <vgutierrez> I basically depooled ats-be in eqiad and esams by accident

09:45 Incident declared resolved

09:48 Updated status page

Detection

Automated alerts / pages fired (FrontendUnavailable)

  • FrontendUnavailable cache_text ()
  • FrontendUnavailable (varnish-text)
  • [5x] ProbeDown (probes/service eqiad)

Conclusions

What went well?

  • Automated alerts fired as expected
  • Oncall was engaged quickly, other folks joined the investigation too
  • A train deployment was suspected as the cause and quickly rolled back

What went poorly?

  • Lots of things going on at the time of the outage
  • Alerting pages were not very descriptive on what was currently failing
  • logmsgbot seemingly didn't log to SAL the manual repools that happened at 9:34

Where did we get lucky?

  • Lots of folks online to help debug

Links to relevant documentation

Actionables

  • Update tunnelencabulator, some SREs had trouble accessing graphs during the outage
    • https://github.com/cdanis/tunnelencabulator/pull/6
  • T330272 Provide a cookbook to perform HAProxy upgrades on CDN nodes
  • T330405 Improve FrontendUnavailable alerts with more information/context of what's failing

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? yes
Is there a phabricator task for the incident? no
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 13