Incidents/2021-09-04 appserver latency

From Wikitech

document status: final

Summary

An increase in load on a database server resulted in many queries being much slower to respond. This in turn meant backend traffic occupies appserver php-fpm workers for much longer, and a proportion of those requests will fail entirely due to unavailable workers. The failed requests got an error page with the message "upstream connect error or disconnect/reset before headers. reset reason: overflow".

Impact: For 37 minutes, backends were slow (taking several seconds to respond) and 2% of requests failed entirely. This affected logged-in users, most bots/API queries, and some page views from unregistered users for pages that were recently edited or otherwise expired from the CDN cache.

Documentation:

Actionables