Incidents/2021-09-04 appserver latency
Appearance
document status: final
Summary
An increase in load on a database server resulted in many queries being much slower to respond. This in turn meant backend traffic occupies appserver php-fpm workers for much longer, and a proportion of those requests will fail entirely due to unavailable workers. The failed requests got an error page with the message "upstream connect error or disconnect/reset before headers. reset reason: overflow".
Impact: For 37 minutes, backends were slow (taking several seconds to respond) and 2% of requests failed entirely. This affected logged-in users, most bots/API queries, and some page views from unregistered users for pages that were recently edited or otherwise expired from the CDN cache.
Documentation:
- Public task about incident. T290373
- Grafana: Application RED dashboard
-
HTTP error rates.
-
Latency buckets.
-
Latency quantile estimates.
Actionables
- T277416 (restricted)