HTTP timeouts

This documents HTTP timeouts involved in a web request from end-users to a service behind WMF traffic layers.

Frontend TLS

The entry point for external clients is HAProxy. Which of the "cp" hosts is routed through, depends on the service and end-user IP address:

TLS termination layer	TLS handshake timeout	connect timeout (origin server)	TTFB (origin server)	successive reads (origin server)	Keepalive timeout (client)
haproxy	60 seconds	3 seconds	180 seconds	180 seconds	120 seconds

Caching

Our caching system is split in two layers (frontend, and backend). There is one implementation of the frontend layer (Varnish) and one implementation of the backend layer (Apache Traffic Server).

caching layer	connect timeout	TTFB	successive reads
varnish-frontend	3 seconds ^(text) / 5 seconds ^(upload)	65 seconds ^(text) / 35 seconds ^(upload)	33 seconds ^(text) / 60 seconds ^(upload)
ats-backend	10 seconds	180 seconds	180 seconds

App server

See also: MediaWiki at WMF#Timeouts

After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:

As of March 2020
layer	request timeout
Envoy (TLS)	203 seconds (appserver) / 65 seconds (api) / 86402.5 seconds (jobrunner, videoscaler) Configured by envoy::upstream_response_timeout
Apache	202 seconds (appserver, api, parsoid) / 1202 seconds (jobrunner) / 86402 seconds (videoscaler). Configured by `Apache Timeout`. Entire request-response, including connection time. Wall clock time.
php-fpm	201 seconds (appservers) / 201 seconds (api) / 201 seconds (parsoid) / 86400 seconds (jobrunner, videoscaler). Configured by `profile::mediawiki::php::request_timeout`. Wall clock time.
PHP	210 seconds (appserver, api, parsoid) / 1200 seconds (jobrunner, videoscaler). Configured by `max_execution_time`. CPU time (not including syscalls and C functions from extensions).
MediaWiki	60 seconds (GET) / 200 seconds (POST) / 1200 seconds (jobrunner) / 86400 seconds (videoscaler). This is configured using php-excimer

Notes

The app server timeouts might be larger than the ones on the caching layer, this is mainly to properly service internal clients.

php-fpm: The request_timeout setting the maximum time php-fpm will spend processing a request before terminating the worker process. This exists as a last-resort to kill PHP processes even if a long-running C function is not yielding to Excimer and/or if PHP raised max_execution_time at run-time.
PHP: The max_execution_time setting in php.ini measures CPU time (not wall clock time), and does not include syscalls.; Note that this is intentionally several seconds higher than the layers above and below because we generally want to avoid requests being stopped by this layer and prefer it to happen either earlier in MW or higher up in php-fpm.; This layer is not able to differentiate between HTTP methods (GET/POST) or virtual hostnames (jobrunner vs videoscaler). As such, it has to accomodate both.; For videoscalers this setting is actually lower than the surrounding layers (1200s/20min vs 86400s/24h). This is a compromise to prevent non-videoscaler jobs from being able to spend 24h on the CPU, which would be very unexpected. Regular jobrunners and videoscalers are forced to share the same php-fpm configuration. This is fine because while videoscaling jobs may use 24h to complete, they are expected to spend most of their time transcoding videos, which happens through syscalls that are not captured by PHP's cpu time.
MediaWiki: This is controlled by $wgRequestTimeLimit in wmf-config/CommonSettings.php. Upon reaching the timeout, the Wikimedia/RequestTimeout library will use php-excimer to throw a Wikimedia\RequestTimeout\TimeoutException exception once the current syscall returns.