SLO/Docker-registry

Status: approved

Organizational

Service

This page covers the highly available Docker registry hosted at docker-registry.wikimedia.org.

Teams

The Service Ops SRE team is the service owner of the Docker registry.

Architectural

Environmental dependencies

The Docker registry runs on Ganeti, active/passive via discovery DNS in eqiad and codfw, with traffic load-balanced via PyBal to two VMs in each data center.

Service dependencies

Beyond the environmental dependencies above, the Docker registry's only hard dependency is Swift, its storage backend. Redis, used as a blob cache, is a soft dependency: during a Redis outage, pulling and pushing images would be slower but would still complete successfully.

Client-facing

Clients

The Docker registry is used by Kubernetes (partly via Dragonfly) and by CI.

Request Classes

Requests are classified by HTTP method and by the API endpoint:

Manifest reads are HTTP GET or HEAD requests to URL paths of the form /v2/<name>/manifests/<reference>.

Tag reads are HTTP GET requests to URL paths of the form /v2/<name>/tags/list.

Blob reads are HTTP GET or HEAD requests to URL paths of the form /v2/<name>/blobs/<digest>.

Manifest writes are HTTP PUT requests to URL paths of the form /v2/<name>/manifests/<reference>.

All other requests are ineligible for the SLO. Only requests sent to the active data center are eligible for the SLO.

Service Level Indicators (SLIs)

Latency SLI: The 95th-percentile request latency, as measured at the server side.

Availability SLI: The percentage of all requests receiving a non-server-error response, defined as HTTP status code less than 500. (200 is the successful response status for reads, 201 or 202 as appropriate for writes. 4xx response codes indicate request errors, like an attempt to fetch a manifest that doesn't exist; for the purpose of calculating availability, these requests are also counted as successful.)

Both SLIs are computed over the Foundation-standard reporting periods: three calendar months, phased one month earlier than the fiscal quarter.

Service Level Objectives

TODO: Grafana links

The 95th-percentile latency for manifest reads will be less than 2 seconds.

The 95th-percentile latency for tag reads will be less than 2 seconds.

The 95th-percentile latency for manifest writes will be less than 3 seconds.

The availability for manifest, tag, and blob reads, measured together, will be at least 99%.

Note that not all request classes have an objective for each SLI.