Incidents/2023-04-17 eqiad/LVS

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	LVS	Start	2023-04-17 14:25:00
Task	T334703	End	2023-04-17 14:36
People paged	2	Responder count	9
Coordinators	Matthew	Affected metrics/SLOs
Impact	For approximately 11 minutes, users accessing Wikimedia sites and services through our eqiad LVS received an error.

…

During a scap deploy of MediaWiki, one of the LVS nodes in eqiad was taken down to be reimaged. This resulted in many servers in eqiad being depooled, and eqiad being unable to serve traffic. Eqiad was depooled for user-facing front edge traffic to reduce impact until servers could be repooled in eqiad and normal operations restored. All scap deployments were blocked while the incident was handled.

Timeline

All times in UTC.

14:17 Scap backport begins SAL
14:21 reimage of lvs1020 begins SAL
14:25 Scap backport completes OK SAL
14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss =100% OUTAGE BEGINS
14:30 Incident opened
14:30 DNS updated to depool eqiad
14:31 appserver, appserver_api, parsoid repooled in eqiad
14:36 Most user traffic now diverted away from eqiad OUTAGE ENDS
14:40 Global scap lock taken to prevent any further scap deploys until incident resolved and LVS maintenance complete
14:52 Incident closed, cleanup ongoing
15:18 confirmed that work on lvs1020 successfully completed
15:18 eqiad repooled
15:20 scap lock released

Graphs

HaProxy availability

haproxy graph of the incident, showing impact on availability

Traffic throughput

Error codes (ATS)

ATS errors from backend during the incident

Detection

The first sign of trouble was scap reporting errors to the person doing the scap deploy. Within a minute the first automatic error appeared:

14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100%

Two VO alerts fired, at 14:27 and 14:28:

<jinxer-wm> (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

<jinxer-wm> (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable

Simultaneously, icinga reported many service records in *.svc.eqiad.wmnetas unavailable.

Conclusions

What went well?

Automated alerting fired quickly
Diagnosis and thus remediation of the outage was rapid

What went poorly?

scap lock was not widely known about (as it happens nothing went wrong as a result of the delay in taking this lock out, but the risk was there)
We didn't learn enough from the recent near-miss incident to avoid a production outage

Where did we get lucky?

The recent near-miss (T334703) meant we knew almost immediately what the problem was, making it straightforward to restore service rapidly.
Timing meant both US and EMEA SREs were available to respond

Links to relevant documentation

n/a?

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Scap should document scap lockDONE
Scap deploys and LVS maintenance cannot happen at the same time (I think T334703 covers this OK)
The api_appserver average latency exceeded alert fired late when latency was in fact going down again. Why?T334949

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes
	Were pages routed to the correct sub-team(s)?	no	no sub-team paging
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	yes
	Was a public wikimediastatus.net entry created?	yes
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	no	previous occurrance was near-miss not an incident
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	no
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		10