Portal:Toolforge/Admin/Kubernetes/New cluster

This page contains information on how to deploy kubernetes for our Toolforge setup. This refers to the basic building blocks on bare-metal (such as etcd, controller, worker, etcs) and not to end-user apps running inside kubernetes.

general considerations

Please take this into account when trying to build a cluster following these instructions.

This was only tested in Debian Buster
You will need a set of packages in reprepro, in thirdparty/kubeadm-k8s, see modules/aptrepo/files/updates in the operations puppet tree.
You need to upload several docker images to our internal docker registry once the regitry admission controller is deployed. See docker registry: uploading custom images.

etcd nodes

A working etcd cluster is the starting point for a working k8s deployments. All other k8s components require it.

The role for the VM should be role::wmcs::toolforge::k8s::etcd.

Typical hiera configuration looks like:

profile::etcd::cluster_bootstrap: false
profile::toolforge::k8s::etcd_nodes:
- tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-3.tools.eqiad1.wikimedia.cloud
profile::puppet::agent::dns_alt_names:
- tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-3.tools.eqiad1.wikimedia.cloud

Because the DNS alt names, Puppet certs will need to be signed by the master with the following command:

aborrero@tools-puppetmaster-02:~ $ sudo puppet cert --allow-dns-alt-names sign tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud

In case a brand-new etcd cluster, the profile::etcd::cluster_bootstrap should be set to true.

A basic cluster health-check command:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 --key-file /var/lib/puppet/ssl/private_keys/$(hostname -f).pem --cert-file /var/lib/puppet/ssl/certs/$(hostname -f).pem cluster-health
member 67a7255628c1f89f is healthy: got healthy result from https://tools-k8s-etcd-4.tools.eqiad1.wikimedia.cloud:2379
member 822c4bd670e96cb1 is healthy: got healthy result from https://tools-k8s-etcd-5.tools.eqiad1.wikimedia.cloud:2379
member cacc7abd354d7bbf is healthy: got healthy result from https://tools-k8s-etcd-6.tools.eqiad1.wikimedia.cloud:2379
cluster is healthy

See if etcd is actually storing data:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://tools-k8s-etcd-4.tools.eqiad1.wikimedia.cloud:2379 --key=/var/lib/puppet/ssl/private_keys/tools-k8s-etcd-4.tools.eqiad1.wikimedia.cloud.pem --cert=/var/lib/puppet/ssl/certs/tools-k8s-etcd-4.tools.eqiad1.wikimedia.cloud.pem  get / --prefix --keys-only | wc -l
290

Delete all data in etcd (warning!), for a fresh k8s start:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud:2379 --key=/var/lib/puppet/ssl/private_keys/tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud.pem --cert=/var/lib/puppet/ssl/certs/tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud.pem del "" --from-key=true
145

Add a new member to the etcd cluster:

We currently have a spicerack cookbook (setup) simplifying the task, so to add a new etcd node to the tools project, you can just run:

> cookbook wmcs.toolforge.add_etcd_node --project tools

note that for toolsbeta, you'll have to provide the option --etcd-prefix as the VM names there don't adhere to the general prefix template.

To do the same manually:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 --key /var/lib/puppet/ssl/private_keys/$(hostname -f).pem --cert /var/lib/puppet/ssl/certs/$(hostname -f).pem member add tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud --peer-urls="https://tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud:2380"
Member bf6c18ddf5414879 added to cluster a883bf14478abd33

ETCD_NAME="tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud"
ETCD_INITIAL_CLUSTER="tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud=https://tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud:2380,tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud=https://tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

NOTE: joining the new node (the member add command above) should be done in a pre-existing node before trying to start the etcd service in the new node.

NOTE: that the etcd service uses puppet certs.

NOTE: these VMs use internal firewalling by ferm. Rules won't change with DNS changes. After creating or destroying VMs that reuse DNS names you might want to force restart of the firewall with something like:

user@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:toolsbeta name:tools-k8s-etcd-.*}' 'systemctl restart ferm'

List current members of a cluster

user@tools-k8s-etcd-13:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 --key /var/lib/puppet/ssl/private_keys/$(hostname -f).pem --cert /var/lib/puppet/ssl/certs/$(hostname -f).pem member list
214aee26edbb7483, started, tools-k8s-etcd-13.tools.eqiad1.wikimedia.cloud, https://tools-k8s-etcd-13.tools.eqiad1.wikimedia.cloud:2380, https://tools-k8s-etcd-13.tools.eqiad1.wikimedia.cloud:2379
25ad48dc38f5b822, started, tools-k8s-etcd-17.tools.eqiad1.wikimedia.cloud, https://tools-k8s-etcd-17.tools.eqiad1.wikimedia.cloud:2380, https://tools-k8s-etcd-17.tools.eqiad1.wikimedia.cloud:2379
3cc7fd0010b673e8, started, tools-k8s-etcd-18.tools.eqiad1.wikimedia.cloud, https://tools-k8s-etcd-18.tools.eqiad1.wikimedia.cloud:2380, https://tools-k8s-etcd-18.tools.eqiad1.wikimedia.cloud:2379

front proxy (haproxy)

The kubernetes front proxy servers both the k8s API (tcp/6443) and the ingress (tcp/30000). Is one of the key components of kubernetes networking and ingress. We use haproxy for this, in a hot-standby setup with keepalived and a virtual ip address. There should be a couple of VMs, but only one is really working at a moment.

There is a DNS name k8s.svc.tools.eqiad1.wikimedia.cloud that should be pointing to the virtual IP address. No public floating IP involved.

Kubernetes itself talks to k8s.tools.eqiad1.wikimedia.cloud (no svc.) for now (due to certificate names), which is a CNAME to the svc. name.

The puppet role for the VMs is role::wmcs::toolforge::k8s::haproxy and a typical hiera configuration looks like:

profile::toolforge::k8s::apiserver_port: 6443
profile::toolforge::k8s::control_nodes:
- tools-k8s-control-1.tools.eqiad1.wikimedia.cloud
- tools-k8s-control-2.tools.eqiad1.wikimedia.cloud
- tools-k8s-control-3.tools.eqiad1.wikimedia.cloud
profile::toolforge::k8s::ingress_port: 30000
profile::toolforge::k8s::worker_nodes:
- tools-k8s-worker-1.tools.eqiad1.wikimedia.cloud
- tools-k8s-worker-2.tools.eqiad1.wikimedia.cloud
prometheus::haproxy_exporter::endpoint: http://localhost:8404/stats;csv
# TODO: add keepalived config

NOTE: in case of toolsbeta, the VMs need a security group that allows connectivity between the front proxy (in tools) and haproxy (in toolsbeta). This security group is called k8s-dynamicproxy-to-haproxy and TCP ports should match those in hiera.
NOTE: in the case of initial bootstrap of the k8s cluster, the FQDN k8s.tools.eqiad1.wikimedia.cloud needs to point to the first control node since otherwise haproxy won't see any active backend and kubeadm will fail. NOTE: all HAProxy VMs need to be allowed to use the virtual ip address on Neutron.

control nodes

The controller nodes are the servers in which the key internal components of kubernetes are running, such as the api-server, scheduler, controller, etc.
There should be 3 control nodes, VMs of at least 2 CPUs and no swap.

The puppet role for the VMs is role::wmcs::toolforge::k8s::control.

Our puppetization requires two values in the labs/private hiera config. One for node_token and one for encryption_key for encrypting secrets at rest in etcd. If using the toolforge versions of the keys, they are profile::toolforge::k8s::node_token and profile::toolforge::k8s::encryption_key while in the generic kubeadm version they are profile::wmcs::kubeadm::k8s::encryption_key and profile::wmcs::kubeadm::k8s::node_token. The node_token value is a random string matching this regex [a-z0-9]{6}\.[a-z0-9]{16} and is used for joining nodes to the cluster, thus it should be regarded as a secret. Once the token expires, it isn't secret anymore. Tokens can be created and deleted in kubeadm overall, but having one in the config for bootstrap can be useful if you don't want to generate new ones every time. The encryption key is an AES CBC key per the upstream docs. You can create one using the command head -c 32 /dev/urandom | base64. It is simpler to have that configuration in place rather than go back and re-encrypt everything later like was done on the initial build for Toolforge.

Typical hiera configuration:

profile::toolforge::k8s::apiserver_fqdn: k8s.tools.eqiad1.wikimedia.cloud
profile::toolforge::k8s::etcd_nodes:
- tools-k8s-etcd-1.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-2.tools.eqiad1.wikimedia.cloud
- tools-k8s-etcd-3.tools.eqiad1.wikimedia.cloud
swap_partition: false

NOTE: if creating or deleting control nodes, you might want to restart the firewall in etcd nodes. (18/11/2020 dcaro: this was not needed when adding a control node to toolsbeta)
NOTE: you should reboot the control node VM after the initial puppet run, to make sure iptables alternatives are taken into account by docker and kube-proxy.
NOTE: control and worker nodes require the tools-new-k8s-full-connectivity neutron security group (this might not be needed, see T268140.

bootstrap

With bootstrap we refer to the process of creating the k8s cluster from scratch. In this particular case, there are no control nodes yet. You are installing the first one.

In this initial situation, the FQDN k8s.tools.eqiad1.wikimedia.cloud should point to the initial controller node, since haproxy won't proxy anything to the yet-to-be-ready api-server.
Also, make sure the etcd server is totally fresh and clean, i.e, doesn't store anything from previous clusters.

In the first control server, run the following commands:

root@tools-k8s-control-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@tools-k8s-control-1:~# mkdir -p $HOME/.kube
root@tools-k8s-control-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml 
podsecuritypolicy.policy/privileged-psp created
clusterrole.rbac.authorization.k8s.io/privileged-psp created
rolebinding.rbac.authorization.k8s.io/kube-system-psp created
podsecuritypolicy.policy/default created
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/toolforge-tool-roles.yaml
[...]
root@tools-k8s-control-1:~# kubectl apply -k /srv/git/maintain-kubeusers/deployments/toolforge
[...]

After this, the cluster has been boostrapped and has 1 single control node. This should work:

root@tools-k8s-control-1:~# kubectl get nodes
NAME                           STATUS   ROLES    AGE     VERSION
tools-k8s-control-1            Ready    master   3m26s   v1.15.1
root@tools-k8s-control-1:~# kubectl get pods --all-namespaces
NAMESPACE     NAME                                                   READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-9cjml               1/1     Running   0          2m12s
kube-system   calico-node-g4hr7                                      1/1     Running   0          2m12s
kube-system   coredns-5c98db65d4-5wgmh                               1/1     Running   0          2m16s
kube-system   coredns-5c98db65d4-5xmnt                               1/1     Running   0          2m16s
kube-system   kube-apiserver-tools-k8s-control-1                     1/1     Running   0          96s
kube-system   kube-controller-manager-tools-k8s-control-1            1/1     Running   0          114s
kube-system   kube-proxy-7d48c                                       1/1     Running   0          2m15s
kube-system   kube-scheduler-tools-k8s-control-1                     1/1     Running   0          106s

existing cluster

Once the first control node is bootstrapped, we consider the cluster to be existing. But this cluster is designed to have 3 control nodes.
Add aditional

NOTE: pay special attention to FQDNs (k8s.<project>.eqiad1.wikimedia.cloud, ...) and connectivity. You may need to restart ferm after updating the hiera keys in etcd nodes before you can add more control nodes to an existing cluster.
NOTE: control and worker nodes require the tools-new-k8s-full-connectivity neutron security group, this can be added after the instance is spun up.

First you need to obtain some data from a pre-existing control node:

root@tools-k8s-control-1:~# kubeadm token create
bs2psl.wcxkn5la28xrxoa1
root@tools-k8s-control-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
2a673bbc603c0135b9ada19b862d92c46338e90798b74b04e7e7968078c78de9
root@tools-k8s-control-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
44550243d244837e17ae866e318e5d49e7db978c3a68b71216f541ca6dd18704

Then, in the new control node:

root@tools-k8s-control-2:~# kubeadm join k8s.tools.eqiad1.wikimedia.cloud:6443 --token ${TOKEN_OUTPUT} --discovery-token-ca-cert-hash sha256:${OPENSSL_OUTPUT} --control-plane --certificate-key ${UPLOADCERTS_OUTPUT}
root@tools-k8s-control-2:~# mkdir -p $HOME/.kube
root@tools-k8s-control-2:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config

The complete cluster should show 3 control nodes and the corresponding pods in the kube-system namespace:

root@tools-k8s-control-2:~# kubectl get pods --all-namespaces
NAMESPACE     NAME                                                   READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-9cjml               1/1     Running   0          117m
kube-system   calico-node-dfbqd                                      1/1     Running   0          109m
kube-system   calico-node-g4hr7                                      1/1     Running   0          117m
kube-system   calico-node-q5phv                                      1/1     Running   0          108m
kube-system   coredns-5c98db65d4-5wgmh                               1/1     Running   0          117m
kube-system   coredns-5c98db65d4-5xmnt                               1/1     Running   0          117m
kube-system   kube-apiserver-tools-k8s-control-1                     1/1     Running   0          116m
kube-system   kube-apiserver-tools-k8s-control-2                     1/1     Running   0          109m
kube-system   kube-apiserver-tools-k8s-control-3                     1/1     Running   0          108m
kube-system   kube-controller-manager-tools-k8s-control-1            1/1     Running   0          117m
kube-system   kube-controller-manager-tools-k8s-control-2            1/1     Running   0          109m
kube-system   kube-controller-manager-tools-k8s-control-3            1/1     Running   0          108m
kube-system   kube-proxy-7d48c                                       1/1     Running   0          117m
kube-system   kube-proxy-ft8zw                                       1/1     Running   0          109m
kube-system   kube-proxy-fx9sp                                       1/1     Running   0          108m
kube-system   kube-scheduler-tools-k8s-control-1                     1/1     Running   0          117m
kube-system   kube-scheduler-tools-k8s-control-2                     1/1     Running   0          109m
kube-system   kube-scheduler-tools-k8s-control-3                     1/1     Running   0          108m
root@tools-k8s-control-2:~# kubectl get nodes
NAME                           STATUS   ROLES    AGE    VERSION
tools-k8s-control-1            Ready    master   123m   v1.15.1
tools-k8s-control-2            Ready    master   112m   v1.15.1
tools-k8s-control-3            Ready    master   111m   v1.15.1

NOTE: you might want to make sure the FQDN k8s.tools.eqiad1.wikimedia.cloud is pointing to the active haproxy node, since you now have api-servers responding in the haproxy backends.

If this is the second node in the cluster, now that you have more than one node in the cluster, delete one of the two coredns pods with, for example, kubectl -n kube-system delete pods coredns-5c98db65d4-5xmnt so that the deployment will spin up a new pod on another control plane server rather than running both coredns pods on the same server.

reconfiguring control plane elements after deployment

Kubeadm doesn't directly reconfigure standing nodes except, potentially, during upgrades. Therefore a change to the init file won't do much for a cluster that is already built. To make a change to some element of the control plane, such as kube-apiserver command-line arguments, you will want to change:

The ConfigMap in the kube-system namespace called kubeadm-config. It can be altered with a command like
```
root@tools-k8s-control-2:~# kubectl edit cm -n kube-system kubeadm-config
```
The manifest for the control plane element you are altering, eg. adding a command line argument for kube-apiserver by editing /etc/kubernetes/manifests/kube-apiserver.yaml, which will automatically restart the service.

This should prevent kubeadm from overwriting changes you made by hand later.

NOTE: Remember to change the manifest files on all control plane nodes.

worker nodes

Worker nodes should be created in VM instances with minimun 2 CPUs and Debian Buster as operating system. Worker nodes should have at least 40 GB in a separate docker-reserved ephemeral disk, which is currently provided with the flavor g3.cores8.ram16.disk20.ephem140.

Using cookbooks

We have a spicerack cookbooks that simplifies adding a new worker node to an existing toolforge instance, juts run:

Adding a node

12:34 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|✔)
dcaro@vulcanus$ cookbook --config ~/.config/spicerack/cookbook.yaml wmcs.toolforge.add_k8s_worker_node --help
 usage: cookbooks.wmcs.toolforge.add_k8s_worker_node [-h] --project PROJECT [--task-id TASK_ID] [--k8s-worker-prefix K8S_WORKER_PREFIX] [--k8s-control-prefix K8S_CONTROL_PREFIX] [--flavor FLAVOR] [--image IMAGE]
 
 WMCS Toolforge cookbook to add a new worker node
 
 optional arguments:
  -h, --help            show this help message and exit
  --project PROJECT     Openstack project where the toolforge installation resides. (default: None)
  --task-id TASK_ID     Id of the task related to this operation (ex. T123456) (default: None)
  --k8s-worker-prefix K8S_WORKER_PREFIX
                        Prefix for the k8s worker nodes, default is <project>-k8s-worker. (default: None)
  --k8s-control-prefix K8S_CONTROL_PREFIX
                        Prefix for the k8s control nodes, default is the k8s_worker_prefix replacing 'worker' by 'control'. (default: None)
  --flavor FLAVOR       Flavor for the new instance (will use the same as the latest existing one by default, ex. g2.cores4.ram8.disk80, ex. 06c3e0a1-f684-4a0c-8f00-551b59a518c8). (default: None)
  --image IMAGE         Image for the new instance (will use the same as the latest existing one by default, ex. debian-10.0-buster, ex. 64351116-a53e-4a62-8866-5f0058d89c2b) (default: None)

Example (adding a new worker with same image/flavor):

12:34 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|✔)
dcaro@vulcanus$ cookbook --config ~/.config/spicerack/cookbook.yaml wmcs.toolforge.add_k8s_worker_node --project toolforge --task-id T674384

It will take care of everything (partitions, puppet master swap, puppet runs, kubeadm join, ...).

Removing a node

This will remove a node from the worker pool and do all the config changes (not many for workers):

12:34 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|✔)
dcaro@vulcanus$ cookbook --config ~/.config/spicerack/cookbook.yaml wmcs.toolforge.remove_k8s_worker_node --help
 usage: cookbooks.wmcs.toolforge.worker.depool_and_remove_node [-h] --project PROJECT [--fqdn-to-remove FQDN_TO_REMOVE] [--control-node-fqdn CONTROL_NODE_FQDN] [--k8s-worker-prefix K8S_WORKER_PREFIX] [--task-id TASK_ID]
 
 WMCS Toolforge cookbook to remove and delete an existing k8s worker node
 
 optional arguments:
  -h, --help            show this help message and exit
  --project PROJECT     Openstack project to manage. (default: None)
  --fqdn-to-remove FQDN_TO_REMOVE
                        FQDN of the node to remove, if none passed will remove the intance with the lower index. (default: None)
  --control-node-fqdn CONTROL_NODE_FQDN
                        FQDN of the k8s control node, if none passed will try to get one from openstack. (default: None)
  --k8s-worker-prefix K8S_WORKER_PREFIX
                        Prefix for the k8s worker nodes, default is <project>-k8s-worker (default: None)
  --task-id TASK_ID     Id of the task related to this operation (ex. T123456) (default: None)

Example (removing the oldest worker):

12:34 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|✔)
dcaro@vulcanus$ cookbook --config ~/.config/spicerack/cookbook.yaml wmcs.toolforge.remove_k8s_worker_node --project toolforge --task-id T674384

Manually

ingress nodes

Ingress nodes are just dedicated worker nodes that don't need as much disk. Currently, many have a dedicated /var/lib/docker LVM volume, but that can be disabled as unnecessary with the hiera value profile::wmcs::kubeadm::docker_vol: false so you don't have to use a very large flavor. Follow the steps for worker nodes, and pass --role ingress to the cookbook.

Other components

Once the basic componets are deployed (etcd, haproxy, control, worker nodes), other components should be deployed as well.

Kubernetes components

All the kubernetes level components that we deploy can be found in the toolforge-deploy repository. See also: Portal:Toolforge/Admin/Kubernetes/Custom_components.

Non-kubernetes components

first tool: fourohfour

This should be one of the first tools deployed, since this handles 404 situations for webservices. The kubernetes service provided by this tool is set as the default backend for nginx-ingress.

TODO: describe how to deploy it.

Metrics

We have an external prometheus server (i.e, prometheus is not running inside the k8s cluster). This server has the name pattern tools-prometheus-*.tools.eqiad1.wikimedia.cloud.

Access control rules that prometheus needs to access the cluster are deployed with the cookbook. You will then need to generate the x509 certs that prometheus will use to auth.

root@tools-k8s-control-2:~# wmcs-k8s-get-cert prometheus
/tmp/tmp.7JaiWyso9m/server-cert.pem
/tmp/tmp.7JaiWyso9m/server-key.pem

Do scp the certs to your laptop. Place the files in the final destinations:

public key in the operations/puppet.git repository, in files/ssl/toolforge-k8s-prometheus.crt.
private key in the labs/private.git repository of the project puppetmaster in modules/secret/secrets/ssl/toolforge-k8s-prometheus.key.

The cert expires in 1 year and this operation should be repeated. See Portal:Toolforge/Admin/Kubernetes/Certificates#External API access for more details.