User:JMeybohm/Docker-Registry-Stresstest

From Wikitech

Potential bottlenecks

  • Swift is active/active in both DCs
  • Registry is only active in codfw
  • We do not have different endpoints for docker-rw and docker-ro registry
  • We could potentially wait for the images to be replicated (swift wise) after pushing from CI
  • Swift <-> docker-registry: Probably fine, better if we could read DC local
  • docker-registry <-> docker clients: Potentially bad, 1GbE shared link on ganeti, used to pull image from Swift as well.

Actions?

  • We could potentially have one docker-registry per rack row, so docker-registry traffic would not leave rows (as scap proxy does)
  • Create a read-only docker registry discovery record that points to both DCs
  • Can we cache (more/at all) with nginx on the docker-registry nodes? DONE
  • What about client_body_buffer_size while pushing?


Tests

I'm running 3 sequential image pulls on each k8s node in codfw (see pulltiming.sh below), going from 1 host to 19 hosts in parallel.

Network per registry node (local nginx cache & dragonfly) 73 nodes
Network per registry node (local nginx cache & dragonfly)
Network per registry node (without local nginx cache)
Network per registry node (with local nginx cache)

Test steps

# cumin1001
HOSTS="kubernetes[2001-2017].codfw.wmnet,kubestage[2001-2002].codfw.wmnet"
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --copy /home/jayme/pulltiming.sh --dest /home/jayme/

for n in $(seq 1 19); do hosts=$(nodeset --pick=$n -f $HOSTS); sudo cumin --force $hosts "/home/jayme/pulltiming.sh p${n}"; done

# Grab results from nodes
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --rcopy /home/jayme/pulltiming --dest /home/jayme/pulltiming/
sudo chown jayme:wikidev -R pulltiming


# local
ssh cumin1001.eqiad.wmnet "tar cfz - pulltiming" | tar xfz -

pulltiming.sh

#!/bin/bash

REPO=docker-registry.discovery.wmnet
IMAGE=restricted/mediawiki-multiversion
TAG=2021-05-14-185433-publish

test_name=$1
iterations=${2:-3}
repo_uri="https://${REPO}"
# Craft the AuthConfig object needed to authenticate to docker-registry.discovery.wmnet
config_json=/var/lib/kubelet/config.json
if sudo test -r "/root/.docker/config.json"; then
    config_json=/root/.docker/config.json
fi
basicauth=$(sudo cat ${config_json} | jq -r ".auths.\"${repo_uri}\".auth" | base64 -d)
if [ -z "$basicauth" ]; then
    echo "Credentials for docker registry not found, aborting"
    exit 1
fi
arr=(${basicauth//:/ })
auth=$(echo -n "{\"username\": \"${arr[0]}\",\"password\": \"${arr[1]}\",\"serveraddress\": \"${repo_uri}\"}" | base64 -w 0)

cd /home/jayme
mkdir -p ./pulltiming/

if [ -n "$test_name" ]; then
    outfile_base="${HOSTNAME}_${test_name}_$(date +%s)"
else
    outfile_base="${HOSTNAME}_$(date +%s)"
fi
for idx in $(seq 3); do
    outfile="${outfile_base}_${idx}"
    sudo docker rmi "${REPO}/${IMAGE}:${TAG}" > /dev/null 2>&1
    sudo curl -s --unix-socket /var/run/docker.sock -XPOST \
        -d "fromImage=${REPO}/${IMAGE}&tag=${TAG}" \
        -H "X-Registry-Auth: ${auth}" \
        'http://docker/v1.18/images/create' | \
            jq -c --unbuffered '. + {time: now}' > "./pulltiming/${outfile}.json"
done

chown jayme:wikidev -R ./pulltiming/

Script to parse/process the data (pulltiming.py)

https://phabricator.wikimedia.org/P15954