SLO/Docker-registry
Status: approved
Organizational
Service
This page covers the highly available Docker registry hosted at docker-registry.wikimedia.org
.
Teams
The Service Ops SRE team is the service owner of the Docker registry.
Architectural
Environmental dependencies
The Docker registry runs on Ganeti, active/passive via discovery DNS in eqiad and codfw, with traffic load-balanced via PyBal to two VMs in each data center.
Service dependencies
Beyond the environmental dependencies above, the Docker registry's only hard dependency is Swift, its storage backend. Redis, used as a blob cache, is a soft dependency: during a Redis outage, pulling and pushing images would be slower but would still complete successfully.
Client-facing
Clients
The Docker registry is used by Kubernetes (partly via Dragonfly) and by CI.
Request Classes
Requests are classified by HTTP method and by the API endpoint:
- Manifest reads are HTTP
GET
orHEAD
requests to URL paths of the form/v2/<name>/manifests/<reference>
.
- Tag reads are HTTP
GET
requests to URL paths of the form/v2/<name>/tags/list
.
- Blob reads are HTTP
GET
orHEAD
requests to URL paths of the form/v2/<name>/blobs/<digest>
.
- Manifest writes are HTTP
PUT
requests to URL paths of the form/v2/<name>/manifests/<reference>
.
All other requests are ineligible for the SLO. Only requests sent to the active data center are eligible for the SLO.
Service Level Indicators (SLIs)
- Latency SLI: The 95th-percentile request latency, as measured at the server side.
- Availability SLI: The percentage of all requests receiving a non-server-error response, defined as HTTP status code less than 500. (200 is the successful response status for reads, 201 or 202 as appropriate for writes. 4xx response codes indicate request errors, like an attempt to fetch a manifest that doesn't exist; for the purpose of calculating availability, these requests are also counted as successful.)
Both SLIs are computed over the Foundation-standard reporting periods: three calendar months, phased one month earlier than the fiscal quarter.
Service Level Objectives
TODO: Grafana links
- The 95th-percentile latency for manifest reads will be less than 2 seconds.
- The 95th-percentile latency for tag reads will be less than 2 seconds.
- The 95th-percentile latency for manifest writes will be less than 3 seconds.
- The availability for manifest, tag, and blob reads, measured together, will be at least 99%.
Note that not all request classes have an objective for each SLI.