document status: draft
|Incident ID||LVS||Start||2023-04-17 14:25:00|
|People paged||2||Responder count||9|
|Impact||For approximately 11 minutes, users accessing Wikimedia sites and services through our eqiad LVS received an error.|
During a scap deploy of MediaWiki, one of the LVS nodes in eqiad was taken down to be reimaged. This resulted in many servers in eqiad being depooled, and eqiad being unable to serve traffic. Eqiad was depooled for user-facing front edge traffic to reduce impact until servers could be repooled in eqiad and normal operations restored. All scap deployments were blocked while the incident was handled.
All times in UTC.
- 14:17 Scap backport begins SAL
- 14:21 reimage of lvs1020 begins SAL
- 14:25 Scap backport completes OK SAL
- 14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss =100% OUTAGE BEGINS
- 14:30 Incident opened
- 14:30 DNS updated to depool eqiad
- 14:31 appserver, appserver_api, parsoid repooled in eqiad
- 14:36 Most user traffic now diverted away from eqiad OUTAGE ENDS
- 14:40 Global scap lock taken to prevent any further scap deploys until incident resolved and LVS maintenance complete
- 14:52 Incident closed, cleanup ongoing
- 15:18 confirmed that work on lvs1020 successfully completed
- 15:18 eqiad repooled
- 15:20 scap lock released
The first sign of trouble was scap reporting errors to the person doing the scap deploy. Within a minute the first automatic error appeared:
14:25 <icinga-wm_> PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100%
Two VO alerts fired, at 14:27 and 14:28:
<jinxer-wm> (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
<jinxer-wm> (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
Simultaneously, icinga reported many service records in
What went well?
- Automated alerting fired quickly
- Diagnosis and thus remediation of the outage was rapid
What went poorly?
scap lockwas not widely known about (as it happens nothing went wrong as a result of the delay in taking this lock out, but the risk was there)
- We didn't learn enough from the recent near-miss incident to avoid a production outage
Where did we get lucky?
- The recent near-miss (T334703) meant we knew almost immediately what the problem was, making it straightforward to restore service rapidly.
- Timing meant both US and EMEA SREs were available to respond
Links to relevant documentation
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
- Scap should document
- Scap deploys and LVS maintenance cannot happen at the same time (I think T334703 covers this OK)
- The api_appserver average latency exceeded alert fired late when latency was in fact going down again. Why?T334949
|People||Were the people responding to this incident sufficiently different than the previous five incidents?||yes|
|Were the people who responded prepared enough to respond effectively||yes|
|Were fewer than five people paged?||yes|
|Were pages routed to the correct sub-team(s)?||no||no sub-team paging|
|Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.||yes|
|Process||Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?||yes|
|Was a public wikimediastatus.net entry created?||yes|
|Is there a phabricator task for the incident?||yes|
|Are the documented action items assigned?||no|
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?||no||previous occurrance was near-miss not an incident|
|Tooling||To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented.
|Were the people responding able to communicate effectively during the incident with the existing tooling?||yes|
|Did existing monitoring notify the initial responders?||yes|
|Were the engineering tools that were to be used during the incident, available and in service?||yes|
|Were the steps taken to mitigate guided by an existing runbook?||no|
|Total score (count of all “yes” answers above)||10|