Incidents/2023-04-17 eqiad/LVS

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID LVS Start 2023-04-17 14:25:00
Task T334703 End 2023-04-17 14:36
People paged 2 Responder count 9
Coordinators Matthew Affected metrics/SLOs
Impact For approximately 11 minutes, users accessing Wikimedia sites and services through our eqiad LVS received an error.

During a scap deploy of MediaWiki, one of the LVS nodes in eqiad was taken down to be reimaged. This resulted in many servers in eqiad being depooled, and eqiad being unable to serve traffic. Eqiad was depooled for user-facing front edge traffic to reduce impact until servers could be repooled in eqiad and normal operations restored. All scap deployments were blocked while the incident was handled.

Timeline

All times in UTC.

  • 14:17 Scap backport begins SAL
  • 14:21 reimage of lvs1020 begins SAL
  • 14:25 Scap backport completes OK SAL
  • 14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss =100% OUTAGE BEGINS
  • 14:30 Incident opened
  • 14:30 DNS updated to depool eqiad
  • 14:31 appserver, appserver_api, parsoid repooled in eqiad
  • 14:36 Most user traffic now diverted away from eqiad OUTAGE ENDS
  • 14:40 Global scap lock taken to prevent any further scap deploys until incident resolved and LVS maintenance complete
  • 14:52 Incident closed, cleanup ongoing
  • 15:18 confirmed that work on lvs1020 successfully completed
  • 15:18 eqiad repooled
  • 15:20 scap lock released

Graphs

HaProxy availability

haproxy graph of the incident, showing impact on availability

Traffic throughput

Traffic throughput (from Varnish) from the incident

Error codes (ATS)

ATS errors from backend during the incident

Detection

The first sign of trouble was scap reporting errors to the person doing the scap deploy. Within a minute the first automatic error appeared:

14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100%

Two VO alerts fired, at 14:27 and 14:28:

<jinxer-wm> (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
<jinxer-wm> (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable

Simultaneously, icinga reported many service records in *.svc.eqiad.wmnetas unavailable.

Conclusions

What went well?

  • Automated alerting fired quickly
  • Diagnosis and thus remediation of the outage was rapid

What went poorly?

  • scap lock was not widely known about (as it happens nothing went wrong as a result of the delay in taking this lock out, but the risk was there)
  • We didn't learn enough from the recent near-miss incident to avoid a production outage

Where did we get lucky?

  • The recent near-miss (T334703) meant we knew almost immediately what the problem was, making it straightforward to restore service rapidly.
  • Timing meant both US and EMEA SREs were available to respond

Links to relevant documentation

  • n/a?

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

  • Scap should document scap lockDONE
  • Scap deploys and LVS maintenance cannot happen at the same time (I think T334703 covers this OK)
  • The api_appserver average latency exceeded alert fired late when latency was in fact going down again. Why?T334949

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? no no sub-team paging
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? yes
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? no previous occurrance was near-miss not an incident
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

no
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 10