SLO/Trafficserver
Status: draft
Organizational
Service
This service is the Apache Trafficserver daemons used as Backend Caches as part of WMF’s production CDN edge infrastructure. This only covers the main trafficserver daemon (/usr/bin/traffic_server) in its capacity to accurately and functionally serve production HTTP requests in real time; it does not cover the various ancillary tools, binaries, statistics/logging mechanisms, etc. All production instances in all datacenters are covered, which means all of the hardware machines currently named cp[0-9]*.{site}.wmnet
.
Teams
SRE/Traffic owns and operates this service layer, and additionally the same team is also responsible for all of the direct dependencies at both edges of this service: the L4LB in the outwards-facing direction, and the Varnish frontend caches in the inwards-facing direction. Therefore, while this service impacts many other teams and services, the responsibility for it is fairly clearly a single-team affair. Other SRE subteams additionally share the burden of on-call incident response for this service.
Architectural
Environmental dependencies
This service runs independently in all 6 datacenters, and also comprises two different clusters named text
and upload
. In any given DC, text
and upload
have identical hardware configurations and layouts. We can (should?) define SLOs both for the global aggregate of a cluster (which may also need to discount manually-depooled time windows for specific datacenters, as with the discussion above about L4LB depools within a DC?), and per-DC per-cluster. The physical characteristics differ per-DC as follows (these numbers are for a single cluster text
or upload
):
Datacenter | DC Layout | Cluster Machines | Cluster Layout |
---|---|---|---|
eqiad | 4 rows, multiple racks each | 8 | 2 machines per row, each in a distinct rack |
codfw | 4 rows, multiple racks each | 8 | 2 machines per row, each in a distinct rack |
esams | 1 row, 3 racks | 8 | Text - 3:3:2 Upload - 2:3:3 (machines in each of the 3 racks) |
ulsfo | 1 row, 2 racks | 8 | 4:4 |
eqsin | 1 row, 2 racks | 8 | 4:4 |
drmrs | 1 row, 2 racks | 8 | 4:4 |
Service dependencies
Hard dependencies: Varnish Frontend caches. Trafficserver isn't able to serve any traffic if Varnish doesn't send any requests on its way.
Client-facing
Clients
All the user-facing traffic comes from the varnish frontend caches
Service Level Indicators (SLIs)
Fraction of requests spending less than 50 milliseconds processing time inside of Trafficserver itself and without a Trafficserver internal error. A Trafficserver internal error is a 500 generated by Trafficserver (as opposed to an underlying backend service), which is not a 503 Fetch Error (failure to get a reply status from the backend service).
Operational
Monitoring
The service is made of a supervisor traffic_manager process responsible for starting a child. The latter handles actual traffic, the former handles administrative commands and restarts the child if it stops responding. There is an Icinga check ensuring that the child is responding to HTTP requests, as well as an additional check called 'trafficserver-backend-restart-count which raises a critical if the supervisor process had to restart its child since it began operating.
Troubleshooting
In most cases of anomalous operation it is sufficient to restart the service and open a task for the Traffic team with a description of the symptoms.
Deployment
The service is deployed by Puppet using the trafficserver puppet module.
Service Level Objectives
The agreed-upon SLO is 99.9% of requests spending less than 50 milliseconds processing time inside of Trafficserver itself and without a Trafficserver Internal error.