Incidents/2024-12-03 Port saturation from cached Varnish HEAD-GET upgrades

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-12-03 Port saturation from cached Varnish HEAD-GET upgrades	Start	2024-12-03 19:59:00
Task	T381771	End	2024-12-09 10:28:00
People paged	10+	Responder count	5
Coordinators	?	Affected metrics/SLOs	No relevant SLOs exist
Impact	Not particularly public

Incoming HEAD requests from Wikimedia Enterprise (WME) to Apache Traffic Server (ATS) resulted in ATS sending a GET to the backend and caching the response. This caused the backend's cache to become "poisoned" with irrelevant cache objects or traffic. Changing ATS' caching behaviors to skip such caching when coming from WME IPs restored desired traffic levels.

Timeline

All times in UTC.

2024-12-03 occurrence

19:59 SATURATION BEGINS (Paging incident)
19:59 Incident acknowledged
20:04 SATURATION ENDS
21:33 SATURATION BEGINS (Paging incident)
21:33 Incident acknowledged
21:38 SATURATION ENDS
22:08 SATURATION BEGINS (Paging incident)
22:08 Incident acknowledged
23:03 SATURATION ENDS

2024-12-07 occurrence

13:59 SATURATION BEGINS (Paging incident)
14:19 SATURATION ENDS
18:18 SATURATION BEGINS (Paging incident)
18:18 Incident acknowledged
18:28 SATURATION ENDS

2024-12-09 occurrence

01:38 SATURATION BEGINS (Paging incident)
01:43 SATURATION ENDS
10:23 SATURATION BEGINS (Paging incident)
10:27 Incident acknowledged
10:28 SATURATION ENDS

16:41 Gerrit change 1101547 is merged and deployed
17:15 Followup Gerrit change 1101561 is merged and deployed. Traffic is reduced.

Visual explanation

17:56 bblack details discoveries on HEAD→GET behaviors in Varnish:
- A client HEAD request on a new, unique URI that no cache has seen before behaves thusly:
  1. Varnish converts the miss to a GET towards ATS.
  2. ATS in turn also GETs the file from Swift.
  3. Varnish sees response to GET with Content-Length >=8MB, and then marks this URI in cache as hit-for-pass for future requests, and synthesizes a HEAD-style response to client (forwards the headers, but not the body).
- When a second client HEAD request comes through the same caches for the same file:
  1. Varnish sees hit-for-pass object, and so it passes the HEAD request through to ATS directly (now ATS sees HEAD instead of GET).
  2. Varnish converts HEAD to GET by default at its layer; Unless it already did that once, it notices the excessive size, and marks the URI as Hit-For-Pass.
  3. ATS now sees it as HEAD as well.
- As WME traffic is scanning they're likely to miss; That's probably why most of them land at ATS already as GETs rather than HEADs

Detection

jinxer-wm alerted SRE in #wikimedia-operations with appropriate messaging:

FIRING: Primary inbound port utilisation over 80%  #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page

While the symptoms were clearly indicated by the alert, the underlying cause was tricky to locate due to its technical nature and not because of any lacking alerting in place. The implemented change can be considered a performance improvement rather than fixing a defect.

asw2-b-eqiad:xe-2/2/0/45 ⇄ cr1-eqiad:xe-3/2/3 over several days.
The same link traffic zoomed in to before and after the patch was merged.
cp1107 traffic declined after the patch was merged.
Telling ATS to not cache these requests converts a client hangup into it hanging up on the backend as well.
ATS resources, such as Lua threads, were hungrily consumed.

Conclusions

What went well?

WME has been quite collaborative on reducing the load even if they're on a short schedule with this scraping.

What went poorly?

Alerts were not specific enough and needed digging to figure out where the issues were occurring.
Incident responders had difficulties parsing the issue and effectively using their time to investigate the issue.
- On-call responder was unable to get back up to speed because there was no incident doc or phabricator bug. The issue was was mentioned in #wikimedia-sre-private in the informal handoffs a few times.
The issue occurred over the weekend.

Where did we get lucky?

Our "attacker" was an internal team and we could coordinate with them.
The saturation was internal; No real-user visible issues (were caused as far as we know).

Links to relevant documentation

bitly's /wmf-librenms slug is primarily advertised in the Alertmanager alert. This links to the (D)DoS Playbook Google Doc. The linked section of that doc suggests Network monitoring#LibreNMS_alerts.

Actionables

Determine if the caching logic change implemented against WME should be implemented globally. (task T382276)

A few actionables were posted at task T381771#10391988:

"at least if we limited it to WME UA and/or IP, it might be nice to just pass HEAD through as HEAD at all layers, uncacheable." bblack proposes something like P71676, (Gerrit patch) which, while untested, hooks vcl_miss() so that we only trigger this behavior when we don't already have the object in cache from other users (task T382274)
Add WME IPs as an ipblock in requestctl (task T382275)
- At the moment WME's ip address is excluded from requestctl processing

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	no
	Were fewer than five people paged?	no
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	no	No Google doc
	Was a public wikimediastatus.net entry created?	no	No visible user impact
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		6