Incident documentation/20190716-docker-registry

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

Some swift containers were deleted intentionally, and that had unexpected consequences on the rest of docker-registry. As a result some layers appeared missing from registry, in particular and at the very least, the ones listed in https://phabricator.wikimedia.org/T228196

The root cause is the intentional deletion of the docker_registry_eqiad swift container in eqiad. This container was configured to synchronize content with the docker_registry_codfw in codfw. When deleted, the container-to-container synchronization triggered a spike of DELETES. To illustrate the issue let's follow a concrete missing layer 0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2

If we download all the swift logs from 07/16 to 07/18 we can track what happened. First the number of actions recorded in the log that involves that layer


swift_activity_0717:45

swift_activity_0718:7

swift_activity_0719:0

Number of DELETES:

swift_activity_0717:38

swift_activity_0718:0

swift_activity_0719:0

Number of PUTS:

swift_activity_0717:7

swift_activity_0718:7

swift_activity_0719:0

Included in the number of DELETEs there are two timeframes, one that started when the swift container was deleted and another one made by swift several hours later: grep '.*DELETE.*0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2.*' swift_activity_071* | cut -f3 -d' ' | sort -u

14:33:07

21:01:21

22:28:20

22:28:21

22:28:24

22:28:28

22:28:33

22:28:38

The different number of PUTs maps with the several attempts of recover layers from backup.

Impact

Some CI jobs failed, other than that, no known impact. We were lucky that no deploys were done in k8s during the period, otherwise production services would have been affected.

Detection

Human was our detection method in this one, it's unclear yet how we could some automatically caught this.

Timeline

All times in UTC. Date format is DD/MM/YYYY Before the detections in #wikimedia-operations

  • 16/07/2019 14:26 deleting docker_registry_eqiad container on eqiad swift cluster and docker_registry and docker_registry_codfw on codfw
  • 16/07/2019 ~16:40: report from tarrow about "filesystem layer verification failed for digest" for many images from docker-registry.wikimedia.org
  • 16/07/2019 20:00 releng is triggering a republish of releng images
  • 16/07/2019 22:45 found a backup on ms-fe2005, uploading only the blobs, it should regenerate old images
  • 16/07/2019 23:00 swift upload ended.
  • 16/07/2019 00:04 rebuild of releng images completed
  • 17/07/2019 07:24 reports of images not working again.
  • 17/07/2019 09:00 reuploaded layers from backup.


Conclusions

When manipulating swift containers with container-to-container synchronization we should be extremely cautious as consequences will last for hours if not days.

List of improvements:


  • We need better monitoring regarding container-to-container synchronization in swift, will be useful to have a metric around failures of the synchronization process of operations done on the docker-registry swift container.
  • We need to improve our docker rebuild process for disaster recovery, the image rebuild took several hours.
  • We need to improve docker registry documentation to include more runbooks or procedures for better diagnostics.
  • We need to rethink our golden images approach, the moment one golden image is truncated will affect almost all images.
  • keep a backup of the swift container in our backup system.

What went well?

  • Cached images on CI and kubernetes nodes helped with not creating impact for end users.
  • Incident response?

What went poorly?

  • Lack of monitoring in swift container-to-container synchronization.
  • When rebuilding releng docker images there was a fear about inadvertently upgrades of software (so is not rebuild anymore is a new image).
  • Rebuilding process is slow.
  • No page was triggered, as monitoring checks the manifest but do not pull an image.

Where did we get lucky?

  • Having a backup of the docker registry container in a swift frontend mitigated greatly the incident, as we were capable of re-uploading missing layers and fix truncated images.

Links to relevant documentation

Actionables

  • Sync boron state of /srv/production-images with repo. [ done ]
  • file some bugs to docker-pkg
  • educate about pinning packages on docker-pkg templates, this will help a lot when rebuilding templates.
  • make a bacula recipe for backing_up docker_registry_codfw swift container. [create phab task]
  • get metrics about swift replication [ pending, create phab task]