User:JMeybohm (WMF)/Docker-Registry-Stresstest
Appearance
Potential bottlenecks
- Swift is active/active in both DCs
- Registry is only active in codfw
- We do not have different endpoints for docker-rw and docker-ro registry
- We could potentially wait for the images to be replicated (swift wise) after pushing from CI
- Swift <-> docker-registry: Probably fine, better if we could read DC local
- docker-registry <-> docker clients: Potentially bad, 1GbE shared link on ganeti, used to pull image from Swift as well.
Actions?
- We could potentially have one docker-registry per rack row, so docker-registry traffic would not leave rows (as scap proxy does)
- Create a read-only docker registry discovery record that points to both DCs
Can we cache (more/at all) with nginx on the docker-registry nodes?DONE- What about client_body_buffer_size while pushing?
Tests
I'm running 3 sequential image pulls on each k8s node in codfw (see pulltiming.sh below), going from 1 host to 19 hosts in parallel.
- With 2 docker-registry nodes: https://people.wikimedia.org/~jayme/pulltiming_2_registries.html
- With 6 docker-registry nodes: https://people.wikimedia.org/~jayme/pulltiming_6_registries.html
- With 6 docker-registry nodes using local nginx cache: https://people.wikimedia.org/~jayme/pulltiming_6_registries_cached.html
- This one has missing data points at 7, 12, 14, 18 and 19 parallel nodes
- With 2 docker-registry nodes, local nginx cache (this is now the default) and dragonfly: https://people.wikimedia.org/~jayme/pulltiming_dragonfly.html
- Same setup, but pulling from max. 73 nodes in parallel: https://people.wikimedia.org/~jayme/pulltiming_dragonfly_73_nodes.html
Test steps
# cumin1001
HOSTS="kubernetes[2001-2017].codfw.wmnet,kubestage[2001-2002].codfw.wmnet"
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --copy /home/jayme/pulltiming.sh --dest /home/jayme/
for n in $(seq 1 19); do hosts=$(nodeset --pick=$n -f $HOSTS); sudo cumin --force $hosts "/home/jayme/pulltiming.sh p${n}"; done
# Grab results from nodes
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --rcopy /home/jayme/pulltiming --dest /home/jayme/pulltiming/
sudo chown jayme:wikidev -R pulltiming
# local
ssh cumin1001.eqiad.wmnet "tar cfz - pulltiming" | tar xfz -
pulltiming.sh
#!/bin/bash
REPO=docker-registry.discovery.wmnet
IMAGE=restricted/mediawiki-multiversion
TAG=2021-05-14-185433-publish
test_name=$1
iterations=${2:-3}
repo_uri="https://${REPO}"
# Craft the AuthConfig object needed to authenticate to docker-registry.discovery.wmnet
config_json=/var/lib/kubelet/config.json
if sudo test -r "/root/.docker/config.json"; then
config_json=/root/.docker/config.json
fi
basicauth=$(sudo cat ${config_json} | jq -r ".auths.\"${repo_uri}\".auth" | base64 -d)
if [ -z "$basicauth" ]; then
echo "Credentials for docker registry not found, aborting"
exit 1
fi
arr=(${basicauth//:/ })
auth=$(echo -n "{\"username\": \"${arr[0]}\",\"password\": \"${arr[1]}\",\"serveraddress\": \"${repo_uri}\"}" | base64 -w 0)
cd /home/jayme
mkdir -p ./pulltiming/
if [ -n "$test_name" ]; then
outfile_base="${HOSTNAME}_${test_name}_$(date +%s)"
else
outfile_base="${HOSTNAME}_$(date +%s)"
fi
for idx in $(seq 3); do
outfile="${outfile_base}_${idx}"
sudo docker rmi "${REPO}/${IMAGE}:${TAG}" > /dev/null 2>&1
sudo curl -s --unix-socket /var/run/docker.sock -XPOST \
-d "fromImage=${REPO}/${IMAGE}&tag=${TAG}" \
-H "X-Registry-Auth: ${auth}" \
'http://docker/v1.18/images/create' | \
jq -c --unbuffered '. + {time: now}' > "./pulltiming/${outfile}.json"
done
chown jayme:wikidev -R ./pulltiming/