Jump to content

Docker-registry/Runbook

From Wikitech
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

On-call guide

Swift is down or content has gone corrupt

If swift content has gone the registry would not be capable of serving images or writing new ones, you need to focus first on Swift outage and when resolved check if content is still there or you need to rebuild and republish every image. One thing it could be done to mitigate at some extent this failure is cache the most frequent content of the registry on the caching layer (varnish or ATS) that would allow for pulling the most used images.

Redis is down

Redis is only used as a blob cache, if it is down pulling and pushing images would be slower but registry will continue to serve content.

Registry is down in one DC, how to failover to the passive?

Use common pool/depool operations. Take in account that clients must use the discovery DNS entry and that the swift replication is usually slow, which means that if there was an image pushed just before the outage this image would probably have not been replicated yet to the other DC.

Users should trigger a rebuild/repush or cache locally the image to avoid issues.

Cannot pull some layer/image or looks corrupt how to debug problems?

Check swift replication

  • See relevant section.

Replication seems synced but users/monitoring reports problems pulling images

  • checkout registry logs, it should report on errors which image/manifest/tag is failing
  • Increase the loglevel of the registry to debug, modify registry configuration on /etc/docker/registry/config.yml and include the following snippet under the log key. After that restart the registry with systemctl restart docker-registry
log:
  level: debug
  • Registry will log which swift or redis calls are failing and after that you will need to dig more in swift or redis. In the event of a swift failure usually republishing images will help

The replication between Swift codfw and eqiad is not working.

The first sanity check is to execute the following:

  • Ssh in a registry on eqiad and run as root source /etc/swift/account_docker\:registry.env && swift stat docker_registry_codfw
  • Ssh in a registry on codfw and run as root source /etc/swift/account_docker\:registry.env && swift stat docker_registry_codfw
  • Compare number of objects and size, replication is slow specially if there has been a lot of activity in a spike. Check also Swift metrics in the docker-registry's dashboard.

If the numbers are not in sync, there is a problem. The Docker Registry is currently (Dec 2025) the only Swift client using the native container replication settings, namely it is Swift itself taking care of replicating the objects. Every Swift host runs a service called swift-container-sync, but only three of them take care of a specific container replication. In our case, since the replication is from codfw to eqiad, we should verify if any error is listed in those hosts. To quickly find them, you can run something like the following from cumin hosts:

sudo cumin 'ms-be2*' 'systemctl status swift-container-sync.service' -b 10 -p 90

The last problem registered was https://phabricator.wikimedia.org/T413008, where for some reason that we didn't get the sqllite databases used by the container-sync daemons got inconsistent between the three replicas, and we had to issue multiple swift DELETE requests to force Swift to realign the three replicas (more info in the task, the procedure is probably a one-off). If this appears to be the problem, then it's best to reach out to the Swift service owners in SRE Data Persistence (e.g. by opening a Phabricator task tagged #sre-swift-storage)

FAQ

I need to delete a tag of a published image

This is not currently possible in the V2 Registry API, you should republish the tag if the content is wrong or ignore it.

I need to delete an image from the registry

You need to ssh in any registry instance and delete the objects that belongs to the image from the swift container, this should not be done unless there are good reasons to do it (security incident for instance).

How to perform garbage collection

Registry will have more layers than the one referenced from manifests, if you want to delete orphan blobs just log in registry instances and execute /usr/bin/docker-registry garbage-collect /etc/docker/registry/config.yml

How to backup the backend swift container

If you need to modify the underlying swift container you may want to backup the content before modifying replication options or other things. This is an extremely dangerous thing and you should do it as a last resort. when the registry writes to swift it writes two kinds of objects 'files' and 'segments' they need to be backed up differently due to segments having a extremely long filename that will make swift cli crash using download/upload.

  • ssh into a server that has swift docker credentials and a los of free disk space, swift frontends are usually a good choice (ms-fe*)
  • Create a directory under /tmp like /tmp/backup
  • Move to /tmp/backup and run source /etc/swift/account_AUTH_docker.env; ionice -c3 swift download --skip-identical -p prefix/ --object-threads 3 --container-threads 3 SWIFT_CONTAINER.The swift download command will start to download to disk the content of 'files' prefix. You can run this multiple times to get updates.
  • When the download has finished, you should execute swift upload --skip-identical -c --object-threads 3 --segment-threads 3 SWIFT_BACKUP_CONTAINER files/.You can run this command multiple times
  • For segments something different needs to be done as segments has an extremely long filename that will make swift cli crash. cd into /tmp/backup and launch source /etc/swift/account_AUTH_docker.env;swift list -p segments SWIFT_SOURCE_CONTAINER > /tmp/filesa; swift list -p SWIFT_BACKUP_CONTAINER > /tmp/filesb;comm -3 <(sort /tmp/filesa) <(sort /tmp/filesb) | xargs -I '{}' -s 300000 bash -c "swift copy -d /SWIFT_BACKUP_CONTAINER SWIFT_SOURCE_CONTAINER '{}' && date". You can run this command multiple times to get updates.

Known problems