Incidents/2024-12-03 Port saturation from cached Varnish HEAD-GET upgrades
document status: final
Summary
Incident ID | 2024-12-03 Port saturation from cached Varnish HEAD-GET upgrades | Start | 2024-12-03 19:59:00 |
---|---|---|---|
Task | T381771 | End | 2024-12-09 10:28:00 |
People paged | 10+ | Responder count | 5 |
Coordinators | ? | Affected metrics/SLOs | No relevant SLOs exist |
Impact | Not particularly public |
Incoming HEAD requests from Wikimedia Enterprise (WME) to Apache Traffic Server (ATS) resulted in ATS sending a GET to the backend and caching the response. This caused the backend's cache to become "poisoned" with irrelevant cache objects or traffic. Changing ATS' caching behaviors to skip such caching when coming from WME IPs restored desired traffic levels.
Timeline
All times in UTC.
2024-12-03 occurrence
- 19:59 SATURATION BEGINS (Paging incident)
- 19:59 Incident acknowledged
- 20:04 SATURATION ENDS
- 21:33 SATURATION BEGINS (Paging incident)
- 21:33 Incident acknowledged
- 21:38 SATURATION ENDS
- 22:08 SATURATION BEGINS (Paging incident)
- 22:08 Incident acknowledged
- 23:03 SATURATION ENDS
2024-12-07 occurrence
- 13:59 SATURATION BEGINS (Paging incident)
- 14:19 SATURATION ENDS
- 18:18 SATURATION BEGINS (Paging incident)
- 18:18 Incident acknowledged
- 18:28 SATURATION ENDS
2024-12-09 occurrence
- 01:38 SATURATION BEGINS (Paging incident)
- 01:43 SATURATION ENDS
- 10:23 SATURATION BEGINS (Paging incident)
- 10:27 Incident acknowledged
- 10:28 SATURATION ENDS
- 16:41 Gerrit change 1101547 is merged and deployed
- 17:15 Followup Gerrit change 1101561 is merged and deployed. Traffic is reduced.
- 17:56 bblack details discoveries on HEAD→GET behaviors in Varnish:
- A client HEAD request on a new, unique URI that no cache has seen before behaves thusly:
- Varnish converts the miss to a GET towards ATS.
- ATS in turn also GETs the file from Swift.
- Varnish sees response to GET with Content-Length >=8MB, and then marks this URI in cache as hit-for-pass for future requests, and synthesizes a HEAD-style response to client (forwards the headers, but not the body).
- When a second client HEAD request comes through the same caches for the same file:
- Varnish sees hit-for-pass object, and so it passes the HEAD request through to ATS directly (now ATS sees HEAD instead of GET).
- Varnish converts HEAD to GET by default at its layer; Unless it already did that once, it notices the excessive size, and marks the URI as Hit-For-Pass.
- ATS now sees it as HEAD as well.
- As WME traffic is scanning they're likely to miss; That's probably why most of them land at ATS already as GETs rather than HEADs
- A client HEAD request on a new, unique URI that no cache has seen before behaves thusly:
Detection
jinxer-wm
alerted SRE in #wikimedia-operations
with appropriate messaging:
FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
While the symptoms were clearly indicated by the alert, the underlying cause was tricky to locate due to its technical nature and not because of any lacking alerting in place. The implemented change can be considered a performance improvement rather than fixing a defect.
-
asw2-b-eqiad:xe-2/2/0/45 ⇄ cr1-eqiad:xe-3/2/3 over several days.
-
The same link traffic zoomed in to before and after the patch was merged.
-
cp1107 traffic declined after the patch was merged.
-
Telling ATS to not cache these requests converts a client hangup into it hanging up on the backend as well.
-
ATS resources, such as Lua threads, were hungrily consumed.
Conclusions
What went well?
- WME has been quite collaborative on reducing the load even if they're on a short schedule with this scraping.
What went poorly?
- Alerts were not specific enough and needed digging to figure out where the issues were occurring.
- Incident responders had difficulties parsing the issue and effectively using their time to investigate the issue.
- On-call responder was unable to get back up to speed because there was no incident doc or phabricator bug. The issue was was mentioned in #wikimedia-sre-private in the informal handoffs a few times.
- The issue occurred over the weekend.
Where did we get lucky?
- Our "attacker" was an internal team and we could coordinate with them.
- The saturation was internal; No real-user visible issues (were caused as far as we know).
Links to relevant documentation
bitly's /wmf-librenms slug is primarily advertised in the Alertmanager alert. This links to the (D)DoS Playbook Google Doc. The linked section of that doc suggests Network monitoring#LibreNMS_alerts.
Actionables
- Determine if the caching logic change implemented against WME should be implemented globally. (task T382276)
A few actionables were posted at task T381771#10391988:
- "at least if we limited it to WME UA and/or IP, it might be nice to just pass HEAD through as HEAD at all layers, uncacheable." bblack proposes something like P71676, (Gerrit patch) which, while untested, hooks
vcl_miss()
so that we only trigger this behavior when we don't already have the object in cache from other users (task T382274) - Add WME IPs as an ipblock in requestctl (task T382275)
- At the moment WME's ip address is excluded from requestctl processing
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | no | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | No Google doc |
Was a public wikimediastatus.net entry created? | no | No visible user impact | |
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 6 |