Docker-registry-runbook

From Wikitech
Jump to navigation Jump to search

Summary

This page covers the docker registry HA service, if you are looking for the old registry that was hosted on darmstadtium.eqiad.wmnet this is not the wiki page you are looking for.

Owner Service Ops Team
Status ACTIVE
Dependencies Swift, PyBal
Services that depends on this one Kubernetes, CI


Architecture

Service Level Objectives

Service level objective 1: 95% of get manifest or tag operations will complete in less than 2s

  • Service level indicator 1: measured by DC, and will only take in account active DC for measurements [graph link]
  • Checked every month or at least every quarter

Service level objective 2: 95% of push manifest or tag operations will complete in less than 3s

  • Service level indicator 1: measured by DC, and will only take in account active DC for measurements [graph link]
  • Checked every quarter/ month

Service level objective 3: registry is serving content (at least in read-only mode, pulling images) 99% of the time

  • Service level indicator 1: 5XX responses over 2XX ratio is less than 1%
  • Checked every quarter or month.

On-call guide

Swift is down or content has gone corrupt

If swift content has gone the registry would not be capable of serving images or writing new ones, you need to focus first on Swift outage and when resolved check if content is still there or you need to rebuild and republish every image. One thing it could be done to mitigate at some extent this failure is cache the most frequent content of the registry on the caching layer (varnish or ATS) that would allow for pulling the most used images.

Redis is down

Redis is only used as a blob cache, if it is down pulling and pushing images would be slower but registry will continue to serve content.

Registry is down in one DC, how to failover to the passive?

Use common pool/depool operations. Take in account that clients must use the discovery DNS entry and that the swift replication is usually slow, which means that if there was an image pushed just before the outage this image would probably have not been replicated yet to the other DC.

Users should trigger a rebuild/repush or cache locally the image to avoid issues.

Cannot pull some layer/image or looks corrupt how to debug problems?

Check swift replication

  • ssh in a registry on eqiad and run as root source /etc/swift/account_docker\:registry.env && swift stat docker_registry_eqiad
  • ssh in a registry on codfw and run as root source /etc/swift/account_docker\:registry.env && swift stat docker_registry_codfw
  • compare number of objects and size, replication is slow specially if there has been a lot of activity in a spike.

Replication seems synced but users/monitoring reports problems pulling images

  • checkout registry logs, it should report on errors which image/manifest/tag is failing
  • Increase the loglevel of the registry to debug, modify registry configuration on /etc/docker/registry/config.yml and include the following snippet under the log key. After that restart the registry with systemctl restart docker-registry
log:
  level: debug
  • Registry will log which swift or redis calls are failing and after that you will need to dig more in swift or redis. In the event of a swift failure usually republishing images will help

FAQ

I need to delete a tag of a published image

This is not currently possible in the V2 Registry API, you should republish the tag if the content is wrong or ignore it.

I need to delete an image from the registry

You need to ssh in any registry instance and delete the objects that belongs to the image from the swift container, this should not be done unless there are good reasons to do it (security incident for instance).

How to perform garbage collection

Registry will have more layers than the one referenced from manifests, if you want to delete orphan blobs just log in registry instances and execute /usr/bin/docker-registry garbage-collect /etc/docker/registry/config.yml

How to backup the backend swift container

If you need to modify the underlying swift container you may want to backup the content before modifying replication options or other things. This is an extremely dangerous thing and you should do it as a last resort. when the registry writes to swift it writes two kinds of objects 'files' and 'segments' they need to be backed up differently due to segments having a extremely long filename that will make swift cli crash using download/upload.

  • ssh into a server that has swift docker credentials and a los of free disk space, swift frontends are usually a good choice (ms-fe*)
  • Create a directory under /tmp like /tmp/backup
  • Move to /tmp/backup and run source /etc/swift/account_AUTH_docker.env; ionice -c3 swift download --skip-identical -p prefix/ --object-threads 3 --container-threads 3 SWIFT_CONTAINER.The swift download command will start to download to disk the content of 'files' prefix. You can run this multiple times to get updates.
  • When the download has finished, you should execute swift upload --skip-identical -c --object-threads 3 --segment-threads 3 SWIFT_BACKUP_CONTAINER files/.You can run this command multiple times
  • For segments something different needs to be done as segments has an extremely long filename that will make swift cli crash. cd into /tmp/backup and launch source /etc/swift/account_AUTH_docker.env;swift list -p segments SWIFT_SOURCE_CONTAINER > /tmp/filesa; swift list -p SWIFT_BACKUP_CONTAINER > /tmp/filesb;comm -3 <(sort /tmp/filesa) <(sort /tmp/filesb) | xargs -I '{}' -s 300000 bash -c "swift copy -d /SWIFT_BACKUP_CONTAINER SWIFT_SOURCE_CONTAINER '{}' && date". You can run this command multiple times to get updates.

Known problems