Jump to content

Portal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes/1.27 to 1.28 notes

From Wikitech

Working etherpad: https://etherpad.wikimedia.org/p/k8s-1.27-to-1.28-upgrade https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes/1.27_to_1.28_notes

Prepare packages

Toolsbeta

  • get list of nodes
root@toolsbeta-test-k8s-control-10:~# for node in $(kubectl get nodes -o json | jq '.items[].metadata.name' -r); do echo "* [] $node"; done

prep

  • [x] ssh into bastion, clone toolforge-deploy and run the functional-tests in a loop so we can detect when/if things start failing and investigate. It is ok to get some failures during the upgrade as pods get evicted and rescheduled but this should go away during subsequent loops of the test run:
raymond-ndibe@local:~$ ssh toolsbeta-bastion-6.toolsbeta.eqiad1.wikimedia.cloud
raymond-ndibe@toolsbeta-bastion-6:~$ git clone https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy.git
raymond-ndibe@toolsbeta-bastion-6:~$ while true; do toolforge-deploy/utils/run_functional_tests.sh -r; done
  • [x] run prepare upgrade cookbook
cloudcumin1001:~$ sudo cookbook wmcs.toolforge.k8s.prepare_upgrade --cluster-name toolsbeta --src-version 1.27.16 --dst-version 1.28.14 --task-id T362867

control nodes

  • run upgrade node cookbook
cloudcumin1001:~$ sudo cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T362867 --src-version 1.27.16 --dst-version 1.28.14 --cluster-name toolsbeta --hostname <control_node_name>
  • check that services start healthy
  • depool control-<x> and <y> via haproxy, check that control-<z> (the one specified in the cookbook) is still doing ok, by re-running the functional-tests
ssh toolsbeta-test-k8s-haproxy-6.toolsbeta.eqiad1.wikimedia.cloud
toolsbeta-test-k8s-haproxy-6:~$ sudo puppet agent --disable "<user> k8s upgrade"
toolsbeta-test-k8s-haproxy-6:~$ sudo vim /etc/haproxy/conf.d/k8s-api-servers.cfg
toolsbeta-test-k8s-haproxy-6:~$ sudo systemctl reload haproxy

check:

toolsbeta-test-k8s-haproxy-6:~$ echo "show stat" | sudo socat stdio /run/haproxy/haproxy.sock | grep k8s-api

revert:

toolsbeta-test-k8s-haproxy-6:~$ sudo puppet agent --enable
toolsbeta-test-k8s-haproxy-6:~$ sudo run-puppet-agent
toolsbeta-test-k8s-haproxy-6:~$ sudo systemctl reload haproxy
toolsbeta-test-k8s-haproxy-6:~$ echo "show stat" | sudo socat stdio /run/haproxy/haproxy.sock | grep k8s-api

Issues:

  • We might want to upgrade the pause image:
W0905 14:29:47.959951  961216 checks.go:835] detected that the sandbox image "docker-registry.tools.wmflabs.org/pause:3.1" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image.

toolsbeta-test-k8s-control-10

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-11 and -12 via haproxy, check that control-10 is still doing ok

toolsbeta-test-k8s-control-11

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-12 and -10 via haproxy, check that control-11 is still doing ok

toolsbeta-test-k8s-control-12

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-10 and -11 via haproxy, check that control-12 is still doing ok


worker nodes

  • run upgrade node cookbook for each
sudo cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T362867 --src-version 1.27.16 --dst-version 1.28.14 --cluster-name toolsbeta --hostname <worker_node_name>
  • [x] toolsbeta-test-k8s-worker-nfs-5
  • [x] toolsbeta-test-k8s-worker-nfs-7
  • [x] toolsbeta-test-k8s-worker-nfs-8
  • [x] toolsbeta-test-k8s-worker-nfs-9
  • [x] toolsbeta-test-k8s-worker-12
  • [x] toolsbeta-test-k8s-worker-13


ingress nodes

  • run upgrade node cookbook for each
sudo cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T362867 --src-version 1.27.16 --dst-version 1.28.14 --cluster-name toolsbeta --hostname <worker_node_name>
  • [x] toolsbeta-test-k8s-ingress-10
  • [x] toolsbeta-test-k8s-ingress-11
  • [x] toolsbeta-test-k8s-ingress-9


cleanup

  • [x] remove downtime
  • [x] revert topic change
  • [x] enable puppet on toolsbeta-test-k8s-haproxy-6.toolsbeta.eqiad1.wikimedia.cloud
    toolsbeta-test-k8s-haproxy-6:~$ sudo puppet agent --enable
    

Tools

  • get list of nodes
root@tools-k8s-control-7:~# for node in $(kubectl get nodes -o json | jq '.items[].metadata.name' -r); do echo "* [] $node"; done

prep

  • [x] ssh into bastion, clone toolforge-deploy and run the functional-tests in a loop so we can detect when/if things start failing and investigate. It is ok to get some failures during the upgrade as pods get evicted and rescheduled but this should go away during subsequent loops of the test run:
    raymond-ndibe@local:~$ ssh login.toolforge.org raymond-ndibe@tools-bastion-13:~$ git clone https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy.git
    raymond-ndibe@tools-bastion-13:~$ while true; do toolforge-deploy/utils/run_functional_tests.sh -r; done
    
  • [x] run prepare upgrade cookbook
    ssh cloudcumin1001.eqiad.wmnet
    cloudcumin1001:~ $ sudo cookbook wmcs.toolforge.k8s.prepare_upgrade --cluster-name tools --src-version 1.27.16 --dst-version 1.28.14 --task-id T362867
    

control nodes

  • run upgrade node cookbook
cloudcumin1001:~$ sudo cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T362867 --src-version 1.27.16 --dst-version 1.28.14 --cluster-name tools --hostname <control_node_name>
  • check that services start healthy
  • depool control-<x> and <y> via haproxy, check that control-<z> (the one specified in the cookbook) is still doing ok (check the functional tests are still passing)
ssh tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud
tools-k8s-haproxy-6:~$ sudo puppet agent --disable "<user> k8s upgrade"
tools-k8s-haproxy-6:~$ sudo nano /etc/haproxy/conf.d/k8s-api-servers.cfg
tools-k8s-haproxy-6:~$ sudo systemctl reload haproxy

check:

echo "show stat" | sudo socat stdio /run/haproxy/haproxy.sock | grep k8s-api

revert:

sudo puppet agent --enable sudo run-puppet-agent sudo systemctl reload haproxy

tools-k8s-control-7

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-8 and -9 via haproxy, check that control-7 is still doing ok

tools-k8s-control-8

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-7 and -9 via haproxy, check that control-8 is still doing ok

tools-k8s-control-9

  • [x] run upgrade node cookbook
  • [x] check that services start healthy
  • [x] depool control-7 and -8 via haproxy, check that control-9 is still doing ok

worker nodes

  • run upgrade node cookbook for each. it's ok to do a couple in parallel
sudo cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T362867 --src-version 1.27.16 --dst-version 1.28.14 --cluster-name tools --hostname <worker_node_name>
  • [x] tools-k8s-worker-102
  • [x] tools-k8s-worker-103
  • [x] tools-k8s-worker-105
  • [x] tools-k8s-worker-106
  • [x] tools-k8s-worker-107
  • [x] tools-k8s-worker-108
  • [x] tools-k8s-worker-nfs-1
  • [x] tools-k8s-worker-nfs-10
  • [x] tools-k8s-worker-nfs-11
  • [x] tools-k8s-worker-nfs-12
  • [x] tools-k8s-worker-nfs-13
  • [x] tools-k8s-worker-nfs-14
  • [x] tools-k8s-worker-nfs-16
  • [x] tools-k8s-worker-nfs-17
  • [x] tools-k8s-worker-nfs-19
  • [x] tools-k8s-worker-nfs-2
  • [x] tools-k8s-worker-nfs-21
  • [x] tools-k8s-worker-nfs-22
  • [x] tools-k8s-worker-nfs-23
  • [x] tools-k8s-worker-nfs-24
  • [x] tools-k8s-worker-nfs-26
  • [x] tools-k8s-worker-nfs-27
  • [x] tools-k8s-worker-nfs-3
  • [x] tools-k8s-worker-nfs-32
  • [x] tools-k8s-worker-nfs-33
  • [x] tools-k8s-worker-nfs-34
  • [x] tools-k8s-worker-nfs-35
  • [x] tools-k8s-worker-nfs-36
  • [x] tools-k8s-worker-nfs-37
  • [x] tools-k8s-worker-nfs-38
  • [x] tools-k8s-worker-nfs-39
  • [x] tools-k8s-worker-nfs-40
  • [x] tools-k8s-worker-nfs-41
  • [x] tools-k8s-worker-nfs-42
  • [x] tools-k8s-worker-nfs-43
  • [x] tools-k8s-worker-nfs-44
  • [x] tools-k8s-worker-nfs-45
  • [x] tools-k8s-worker-nfs-46
  • [x] tools-k8s-worker-nfs-47
  • [x] tools-k8s-worker-nfs-48
  • [x] tools-k8s-worker-nfs-5
  • [x] tools-k8s-worker-nfs-50
  • [x] tools-k8s-worker-nfs-53
  • [x] tools-k8s-worker-nfs-54
  • [x] tools-k8s-worker-nfs-55
  • [x] tools-k8s-worker-nfs-57
  • [x] tools-k8s-worker-nfs-58
  • [x] tools-k8s-worker-nfs-61
  • [x] tools-k8s-worker-nfs-65
  • [x] tools-k8s-worker-nfs-66
  • [x] tools-k8s-worker-nfs-67
  • [x] tools-k8s-worker-nfs-68
  • [x] tools-k8s-worker-nfs-69
  • [x] tools-k8s-worker-nfs-7
  • [x] tools-k8s-worker-nfs-70
  • [x] tools-k8s-worker-nfs-71
  • [x] tools-k8s-worker-nfs-72
  • [x] tools-k8s-worker-nfs-73
  • [x] tools-k8s-worker-nfs-74
  • [x] tools-k8s-worker-nfs-75
  • [x] tools-k8s-worker-nfs-76
  • [x] tools-k8s-worker-nfs-8
  • [x] tools-k8s-worker-nfs-9

ingress nodes

  • [x] kubectl -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=2

run upgrade node cookbook for each:

  • [x] tools-k8s-ingress-7
  • [x] tools-k8s-ingress-8
  • [x] tools-k8s-ingress-9
  • [x] revert afterwards: kubectl -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=3

cleanup

  • [x] remove downtime
  • [x] revert topic change
  • [x] enable puppet on tools-k8s-haproxy-6.tools.eqiad1.wikimedia.cloud
    tools-k8s-haproxy-6:~$ sudo puppet agent --enable