Jump to content

Incidents/2025-11-06 WAF

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-11-06 WAF Start 2025-11-06 11:20:00
Task End 2025-11-06 11:24:00
People paged Responder count Joe
Coordinators Affected metrics/SLOs
Impact Users across all sites might have experienced error responses saying "too many requests". About 15000 requests per second were affected, or about 10% of the total request rate at the time.

A routine change to our Web Application Firewall was erroneously set to a global rate-limit and not a per-IP one.

Timeline

All times in UTC.

  • 11:20 Giuseppe switches the "global rate limits" rule in our Web Application Firewall to global throttling instead of selecting "throttling per ip", and commits the change (which is syntactically correct) without a +1 as this was an attempt to fix an ongoing issue with the CDN configuration. OUTAGE BEGINS
  • 11:22 We notice some people getting rate-limited
  • 11:23 Giuseppe fixes the mistake, and commits the corrected rule OUTAGE ENDS

Detection

We detected the problem well before any alerting could fire because Effie got a 429 while looking at dashboards.

Conclusions

Performing changes in haste on critical systems that allow a fallout like this is always a bad idea, even when trying to repair a known-bad situation that is not user-visible.

What went well?

  • The WAF worked as expected, and the resolution spread quickly.

What went poorly?

I (joe) didn't ask for a +1 on my modification, because I thought I was just rolling back to the previous state. A plain operational mistake on my part

Where did we get lucky?

We were lucky one of us noticed the issue immediately and the impact was therefore minimal.