Portal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes
This document only applies to a kubeadm-managed cluster deployed as described in Portal:Toolforge/Admin/Kubernetes/Deploying.
Prepare upgrade
Create an upgrade task
If there's not already one, create a new upgrade task, you can use phab:T359641 as template, fill up the sections with the new info.
For the components, make sure you check the compatibility in the toolforge-deploy repo.
Kubernetes changelog
You or someone else with good understanding of everything that runs inside our Kubernetes cluster should read through the Kubernetes upstream release notes and changelog for the release we're upgrading to.
Also, look at the deprecated API call dashboard for the target version. It does not tell what is making those requests, but tells if they exist. (It might be coming from inside the control plane!)
Third-party components
You need to check that all of the third-party components listed in the toolforge-deploy repo are compatible with the new version we're upgrading to. If not, upgrade them to a release that is compatible with both the current and the new version.
Etcd
Also check that the etcd version we run is supported by the new Kubernetes release.
Managing packages
We mirror the Kubernetes Apt repository to reprepro, in a component named thirdparty/kubeadm-k8s-X-YY
. Generally speaking, you can copy-paste the component and update stanzas for the current version and adjust the version numbers. Remember to update also the thirdparty/helm3
component to point to the new kubeadm
one.
Upgrade lima-kilo
You need to update the node image used in lima-kilo.
Announce user-facing changes
Upgrade a cluster
Begin upgrade
- Run the prepare-upgrade cookbook
user@cloudcumin1001:~$ sudo cookbook wmcs.toolforge.k8s.prepare_upgrade --help
usage: cookbooks.wmcs.toolforge.k8s.prepare_upgrade [-h] --cluster-name {tools,toolsbeta} [--task-id TASK_ID] [--no-dologmsg] --src-version SRC_VERSION --dst-version DST_VERSION
WMCS Toolforge Kubernetes - prepares a cluster for upgrading
Usage example:
cookbook wmcs.toolforge.k8s.prepare_upgrade \
--cluster-name toolsbeta \
--src-version 1.22.17 \
--dst-version 1.23.15
optional arguments:
-h, --help show this help message and exit
--cluster-name {tools,toolsbeta}
cluster to work on (default: None)
--task-id TASK_ID Id of the task related to this operation (ex. T123456). (default: None)
--no-dologmsg To disable dologmsg calls (no SAL messages on IRC). (default: False)
--src-version SRC_VERSION
Old version to upgrade from. (default: None)
--dst-version DST_VERSION
New version to migrate to. (default: None)
- downtime the project on metricsinfra
- open https://prometheus-alerts.wmcloud.org
- click the bell icon on top right
- add filter on the project label
- add a few hours of duration
- add reason
- click save
- if user-visible cluster, update topic on wikimedia-cloud "Status: Ok" to "Status: upgrading Toolforge k8s"
Upgrade control nodes
Run the upgrade worker cookbook for the first control node.
usage: cookbook [GLOBAL_ARGS] wmcs.toolforge.k8s.worker.upgrade [-h] --cluster-name {tools,toolsbeta} [--task-id TASK_ID] [--no-dologmsg] --hostname HOSTNAME --src-version SRC_VERSION --dst-version DST_VERSION
WMCS Toolforge - Upgrade a Kubernetes worker node
Usage example:
cookbook wmcs.toolforge.k8s.worker.upgrade \
--cluster-name toolsbeta \
--hostname toolsbeta-test-worker-4 \
--src-version 1.22.17 \
--dst-version 1.23.15
options:
-h, --help show this help message and exit
--cluster-name {tools,toolsbeta}
cluster to work on (default: None)
--task-id TASK_ID Id of the task related to this operation (ex. T123456). (default: None)
--no-dologmsg To disable dologmsg calls (no SAL messages on IRC). (default: False)
--hostname HOSTNAME Host name of the node to upgrade. (default: None)
--src-version SRC_VERSION
Old version to upgrade from. (default: None)
--dst-version DST_VERSION
New version to migrate to. (default: None)
On the first control node, the cookbook will ask you to approve the upgrade plan. You should save this in case it's needed for later troubleshooting.
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/versions] Latest stable version: v1.15.1
[upgrade/versions] Latest version in the v1.15 series: v1.15.1
External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':
COMPONENT CURRENT AVAILABLE
Etcd 3.2.26 3.3.10
Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT CURRENT AVAILABLE
Kubelet 5 x v1.15.0 v1.15.1
Upgrade to the latest version in the v1.15 series:
COMPONENT CURRENT AVAILABLE
API Server v1.15.0 v1.15.1
Controller Manager v1.15.0 v1.15.1
Scheduler v1.15.0 v1.15.1
Kube Proxy v1.15.0 v1.15.1
CoreDNS 1.3.1 1.3.1
You can now apply the upgrade by executing the following command:
kubeadm upgrade apply v1.15.1
Note: Before you can perform this upgrade, you have to update kubeadm to v1.15.1.
Some important things to note here:
- Etcd is external, so upgrades there need to involve the packaged versions. Make sure that the version we are using (or that can be upgraded to) is acceptable to the new version of Kubernetes before trying anything.
kubeadm
is deployed from packages, which need to be upgraded, including kubelet in order to finish an upgrade.
Now wait a few minutes until the cookbook finishes. Check that all control plane pods (scheduler, apiserver and controller-manager) start up, do not start crash looping and don't have any errors in their logs. See #Troubleshooting if they do.
Repeat the cookbook for the remaining control nodes, and check the logs again.
Upgrade worker nodes
Once the control nodes have been upgraded, we can upgrade the workers.
You now need to run the wmcs.toolforge.k8s.worker.upgrade
cookbook for each worker node. The currently recommended way is to split the list of normal and NFS workers into two or three chunks, then make that many shell scripts that call the upgrade cookbook for each node in the chunk. Start those scripts in separate screen/tmux tabs.
Ingress nodes
The ingress nodes are similar to the worker nodes but they need some special treatment:
- On a Toolforge bastion, run
kubectl sudo -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=2
to prevent an ingress controller from being scheduled on a regular node. - Ingress pods take a while to evict. It should be safe to upgrade the ingress nodes in parallel with the normal worker nodes.
- When done, run
kubectl sudo -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=3
to return the cluster to normal operation.
finishing touches
- Upgrade kubectl on bastions
- Revert topic changes on -cloud
- Remove Alertmanager downtime
Troubleshooting
Permission errors after control plane upgrades
Sometimes the control plane components log error messages after upgrading a control node. Stuff like:
E0410 09:18:10.387734 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: leases.coordination.k8s.io "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
The exact cause of this is unknown. Some theories include a race condition in which the controller-manager pods starts before the api-server.
Try:
- a VM reboot
- if didn't work, a manual restart of the affected static pod (copy out the file from /etc/kubernetes/manifests/, wait for the pod to disappear, then put the file back in the same place)