Incidents/2023-02-22 wiki outage

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2023-02-22 wiki outage	Start	2023-02-22 9:18
Task		End	2023-02-22 9:45
People paged	2	Responder count	2
Coordinators	Filippo Giunchedi	Affected metrics/SLOs
Impact	For approximately 18 minutes, around 17% of incoming, non-multimedia Wikimedia traffic received a 503, 500 error or were missing (most requests coming from eqiad and esams geolocated clients and using our cache layer: wikis, Phabricator, Grafana, ...)

During a routine maintenance consisting of upgrading HAProxy on cache hosts, all of the backends (ATS) in the text cache cluster in esams and eqiad were accidentally depooled due to a mismatch on the maintenance run between depooling the hosts individually and pooling back the cdn. This caused both cached and uncached traffic requests for wikis and other ATS-backed services to fail and return errors to clients, mostly in parts of Europe, Africa and Asia. Approximately 17 million HTTP requests (according to varnish) / 5 million user requests (according to NEL estimation) errored out in total. Editing rate was reduced to less than half. Upload cluster, clients geolocated to drmrs, codfw, ulsfo or eqsin, and GET requests cached in memory were not affected.

Timeline

08:57 vgutierrez@cumin1001:~$ sudo -i cumin -b1 -s60 'A:cp-text_esams' 'depool && sleep 30 && DEBIAN_FRONTEND=noninteractive apt-get -q -y --assume-no -o DPkg::Options::="--force-confdef" install haproxy && run-puppet-agent -q && systemctl restart haproxy && sleep 5 && pool cdn' # note the mismatch between depool and pool cdn

09:18 Esams fully depooled Outage starts here

09:20 pages start rolling in

09:24: <hashar> I am going to rollback to rule out the train

09:25 Updated https://www.wikimediastatus.net/

09:30 <vgutierrez> esams isn't able to reach appservers-ro or api-ro for some reason

09:31 <hashar> (train rolled back)

09:34 <logmsgbot> !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=esams,service=ats-be,cluster=cache_text

<logmsgbot> !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=ats-be,cluster=cache_text

09:36 Outage stops here

09:35 <vgutierrez> I basically depooled ats-be in eqiad and esams by accident

09:45 Incident declared resolved

09:48 Updated status page

Detection

Automated alerts / pages fired (FrontendUnavailable)

FrontendUnavailable cache_text ()
FrontendUnavailable (varnish-text)
[5x] ProbeDown (probes/service eqiad)

Conclusions

What went well?

Automated alerts fired as expected
Oncall was engaged quickly, other folks joined the investigation too
A train deployment was suspected as the cause and quickly rolled back

What went poorly?

Lots of things going on at the time of the outage
Alerting pages were not very descriptive on what was currently failing
logmsgbot seemingly didn't log to SAL the manual repools that happened at 9:34

Where did we get lucky?

Lots of folks online to help debug

Links to relevant documentation

Incident response

Actionables

Update tunnelencabulator, some SREs had trouble accessing graphs during the outage
- https://github.com/cdanis/tunnelencabulator/pull/6
T330272 Provide a cookbook to perform HAProxy upgrades on CDN nodes
T330405 Improve FrontendUnavailable alerts with more information/context of what's failing

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes
	Were pages routed to the correct sub-team(s)?	yes
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	yes
	Was a public wikimediastatus.net entry created?	yes
	Is there a phabricator task for the incident?	no
	Are the documented action items assigned?	yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		13