Kubernetes/Administration

Create a new cluster

Documentation for creating a new cluster is in Kubernetes/Clusters/New

Add a new service

Documentation on how to deploy a new service can be found at Kubernetes/Add_a_new_service

Remove a service

Documentation on how to remove a service can be found at Kubernetes/Remove_a_service

Managing pool status of a worker node

Use the sre.k8s.pool-depool-node cookbook, it will drain, cordon, and manage confd pool status for you.

Rebooting control plane nodes

Control planes can be rebooted one by one via the led sre.hosts.reboot-singe cookbook: sudo cookbook sre.hosts.reboot-single -r "some reboots" --depool kubemaster4009.nocopypaste.wmnet

If you want to be very polite you might want to reboot the control plane not being elected as leader for kube-scheduler/kube-controller-manager first, so that re-election does not need to happen twice: kubectl -n kube-system get leases.coordination.k8s.io

Rebooting worker nodes

The way of the cookbook (recommended)

There is now a cookbook called sre.k8s.reboot-nodes that can be used to perform a rolling-restart of all worker nodes in a cluster.

For example, the following example performs a rolling-reboot of the dse-k8s-eqiad worker nodes.

sudo cookbook sre.k8s.reboot-nodes --reason "rebooting to pick up new kernel" --task-id T321310 --alias dse-k8s-worker reboot

Behind the scenes it drains and cordons each node in turn, which is effectively the polite way below.

The polite way

If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers for draining each wikikube worker node at the end of 2019 were around 60 seconds.

# kubectl drain --ignore-daemonsets --delete-emptydir-data kubernetes1001.eqiad.wmnet
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1003.eqiad.wmnet/10.64.32.23
kubernetes1004.eqiad.wmnet/10.64.48.52
kubernetes1005.eqiad.wmnet/10.64.0.145
kubernetes1006.eqiad.wmnet/10.64.32.18
# kubectl get nodes
NAME                         STATUS                     ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready,SchedulingDisabled   <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready                      <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready                      <none>    231d      v1.12.9

When the node has been rebooted, it can be configured to reaccept pods using kubectl uncordon, e.g.

# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl get nodes
NAME                         STATUS    ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready     <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready     <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready     <none>    231d      v1.12.9

The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.

The impolite way

To reboot a worker node, you can just reboot it in our environment. The platform will understand the event and respawn the pods on other nodes. However the system does not automatically rebalance itself when a node rejoins the cluster. i.e. Pods are not rescheduled on the node after it has been rebooted.

Drain a dead node

A regular kubectl drain will try to gracefully evict Pods from the given node. Naturally this is not possible if the node in question is dead/not reachable for whatever reason. In such a case kubectl drain can be instructed to ignore the grace period and not to wait for the pods to gracefully exit (as that will never happen):

kubectl drain --delete-emptydir-data --ignore-daemonsets --grace-period=0 kubernetes2XXX.codfw.wmnet

Restarting specific components

kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.

commands would be

sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler

Restarting the API server

It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.

sudo systemctl restart kube-apiserver

If you need to restart all API servers, it might be wise to start with the ones that are not currently leading the cluster (to avoid multiple leader elections). The current leader is stored in the control-plane.alpha.kubernetes.io/leader annotation of the kube-scheduler endpoint:

kubectl -n kube-system describe ep kube-scheduler

Switch the active staging cluster (eqiad<->codfw)

We do have one staging cluster per DC, mostly to separate staging of kubernetes and components from staging of the services running on top of it. To provide staging services during work on one of the clusters, we can (manually) switch between the DCs:

Switch staging.svc.eqiad.wmnet to point to the new active k8s cluster (we should have a better solution/DNS name for this at some point)
- https://gerrit.wikimedia.org/r/c/operations/dns/+/884900
Switch the definition of "staging" on the deployment servers, CI etc.:
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/884905
- ```
sudo cumin -b 6 'O:releases or O:ci::master or O:gitlab_runner or O:deployment_server::kubernetes' 'run-puppet-agent -q'
```

Switch the ingress dns discovery

# This will NOT trigger confd to change the DNS admin state as it will cause a validation error
sudo confctl --object-type discovery select "name=codfw,dnsdisc=k8s-ingress-staging" set/pooled=true
# Depool DNS discovery records on the old dc, confd will apply the change
sudo confctl --object-type discovery select "name=eqiad,dnsdisc=k8s-ingress-staging" set/pooled=false

Make sure all service deployments are up to date after the switch (e.g. deploy them all)

Managing pods, jobs and cronjobs

Commands should be run from the deployment servers (at the time of this writing deploy1002).

You need to set the correct context, for example:

kube_env <your service> eqiad

Other choices are codfw, staging.

The management commands is called kubectl. You may find some more inspiration on kubectl commands at Kubernetes/kubectl_Cheat_Sheet

Listing cronjobs, jobs and pods

kubectl get cronjobs -n <namespace>
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>
kubectl get pods -n <namespace> -o wide

Note: -o wide will include on which node a pod resides

Deleting a job

kubectl delete job <job id>

Updating the docker image run by a CronJob

The relationship between the resources is the following:

Cronjob --spawns--> Job(s) --spawns--> Pod(s)

Note: Technically speaking, it's a tight control loop that lives in kube-controller-manager that does the spawning part, but adding that to the above would make this more confusing.

Under normal conditions a docker image version will be updated when a new deploy happens. The cronjob will have the new version. However, already created jobs by the CronJob will not be stopped until they have run to completion.

When the job finishes, the cronjob will create new job(s), which in turn will create new pod(s).

Depending on the correlation between a CronJob scheduling and the job run time there might be a window of time where despite the new deployment, the old job is still running.

Deleting the kubernetes pod created by the job itself will NOT work, i.e. the job will still exist and it will create a new pod (which will still have the old image).

So, if we are dealing with a long running kubernetes Job one can get the same effect by deleting the kubernetes job created by the cronjob.

phab:T280076 is an example where this was needed.

Recreate pods (of deployments, daemonsets, statefulsets, ...)

Pods which are backed by workloads controllers (such as Deployments or Daemonsets) can be easily recreated, without the need to manually delete them, using `kubectl rollout`. This will make sure that the update strategy specified for the set of pods as well as disruption budgets etc. are properly honored.

To restart all pods of a specific Deployment/Daemonset:

kubectl -n NAMESPACE rollout restart [deployment|daemonset|statefulset|...] NAME

You may also restart all Pods of all Deployments/Daemonsets in a specific namespace just by omitting the name. The command will immediately return (e.g. not wait for the process to complete) and the scheduler will do the actual rolling restart in background for you. In order to restart workload across multiple namespaces, one can use something like:

kubectl get ns -l app.kubernetes.io/managed-by=Helm -o jsonpath='{.items[*].metadata.name}' | xargs -L1 -d ' ' kubectl rollout restart deployment -n

With or without label filters. The above ensures that for example workload in pre-defined namespaces (like kube-system) does not get restarted.

Running a rolling restart of a Helmfile service

To rolling-restart a service described by a Helmfile, you don't need to use kubectl; instead, run

cd /srv/deployment-charts/helmfile.d/services/${SERVICE?}
helmfile -e ${CLUSTER?} --state-values-set roll_restart=1 sync

Tips and tricks

Isolate a pod from traffic and deployments

In some cases, you may want to investigate the behavior of a single pod that is acting weird, while restarting the rest of them to recover the service. It is possible to do so by removing from the pod the labels that both Kubernetes Service resources as well as Deployments/ReplicaSets use to select the pods. Get the name of the pod in whatever way you can and

$ kubectl edit <pod_name>

remove under metadata.labels the labels that are used by the Service and the Deployment. Usually just removing app, will suffice. The label selector can be found using

kubectl get svc -o "jsonpath={.items[].spec.selector}"

You can now debug the pod with whatever tools you prefer without fear of traffic flowing to it or deployments killing it. Don't forget to manually delete it after you are done.

$ kubectl delete pod <pod_name>