Portal:Toolforge/Admin/Deploying k8s

From Wikitech
Jump to navigation Jump to search

This page contains information on how to deploy kubernetes for our Toolforge setup. This refers to the basic building blocks on bare-metal (such as etcd, controller, worker, etcs) and not to end-user apps running inside kubernetes.

general considerations

Please take this into account when trying to build a cluster following these instructions.

etcd nodes

A working etcd cluster is the starting point for a working k8s deployments. All other k8s components require it.

The role for the VM should be role::wmcs::toolforge::k8s::etcd.

Typical hiera configuration looks like:

profile::etcd::cluster_bootstrap: false
profile::toolforge::k8s::control_nodes:
- tools-k8s-control-1.tools.eqiad.wmflabs
- tools-k8s-control-2.tools.eqiad.wmflabs
- tools-k8s-control-3.tools.eqiad.wmflabs
profile::toolforge::k8s::etcd_nodes:
- tools-k8s-etcd-1.tools.eqiad.wmflabs
- tools-k8s-etcd-2.tools.eqiad.wmflabs
- tools-k8s-etcd-3.tools.eqiad.wmflabs

In case a brand-new etcd cluster, the profile::etcd::cluster_bootstrap should be set to true.

A basic cluster health-check command:

user@tools-k8s-etcd-1:~$ sudo etcdctl --endpoints https://tools-k8s-etcd-4.tools.eqiad.wmflabs:2379 --key-file /var/lib/puppet/ssl/private_keys/tools-k8s-etcd-4.tools.eqiad.wmflabs.pem --cert-file /var/lib/puppet/ssl/certs/tools-k8s-etcd-4.tools.eqiad.wmflabs.pem cluster-health
member 67a7255628c1f89f is healthy: got healthy result from https://tools-k8s-etcd-4.tools.eqiad.wmflabs:2379
member 822c4bd670e96cb1 is healthy: got healthy result from https://tools-k8s-etcd-5.tools.eqiad.wmflabs:2379
member cacc7abd354d7bbf is healthy: got healthy result from https://tools-k8s-etcd-6.tools.eqiad.wmflabs:2379
cluster is healthy

See if etcd is actually storing data:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://tools-k8s-etcd-4.tools.eqiad.wmflabs:2379 --key=/var/lib/puppet/ssl/private_keys/tools-k8s-etcd-4.tools.eqiad.wmflabs.pem --cert=/var/lib/puppet/ssl/certs/tools-k8s-etcd-4.tools.eqiad.wmflabs.pem  get / --prefix --keys-only | wc -l
290

Delete all data in etcd (warning!), for a fresh k8s start:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://tools-k8s-etcd-1.tools.eqiad.wmflabs:2379 --key=/var/lib/puppet/ssl/private_keys/tools-k8s-etcd-1.tools.eqiad.wmflabs.pem --cert=/var/lib/puppet/ssl/certs/tools-k8s-etcd-1.tools.eqiad.wmflabs.pem del "" --from-key=true
145

Manually add a new member to the etcd cluster:

user@tools-k8s-etcd-1:~$ sudo ETCDCTL_API=3 etcdctl --endpoints https://tools-k8s-etcd-1.tools.eqiad.wmflabs:2379 --key=/var/lib/puppet/ssl/private_keys/tools-k8s-etcd-1.tools.eqiad.wmflabs.pem --cert=/var/lib/puppet/ssl/certs/tools-k8s-etcd-1.tools.eqiad.wmflabs.pem member add tools-k8s-etcd-2.tools.eqiad.wmflabs --peer-urls="https://tools-k8s-etcd-2.tools.eqiad.wmflabs:2380"
Member bf6c18ddf5414879 added to cluster a883bf14478abd33

ETCD_NAME="tools-k8s-etcd-2.tools.eqiad.wmflabs"
ETCD_INITIAL_CLUSTER="tools-k8s-etcd-1.tools.eqiad.wmflabs=https://tools-k8s-etcd-1.tools.eqiad.wmflabs:2380,tools-k8s-etcd-2.tools.eqiad.wmflabs=https://tools-k8s-etcd-2.tools.eqiad.wmflabs:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

NOTE: that the etcd service uses puppet certs.

NOTE: these VMs use internal firewalling by ferm. Rules won't change with DNS changes. After creating or destroying VMs you might want to force restart of the firewall with something like:

user@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:toolsbeta name:tools-k8s-etcd-.*}' 'systemctl restart ferm'

front proxy (haproxy)

The kubernetes front proxy servers both the k8s API (tcp/6443) and the ingress (tcp/30000). Is one of the key components of kubernetes networking and ingress. We use haproxy for this, in a cold-standby setup. There should be a couple of VMs, but only one is really working.

There is a DNS name k8s.tools.eqiad1.wikimedia.cloud that should be pointing to the active VM. No floating IP involved.

The puppet role for the VMs is role::wmcs::toolforge::k8s::haproxy and a typical hiera configuration looks like:

profile::toolforge::k8s::apiserver_port: 6443
profile::toolforge::k8s::control_nodes:
- tools-k8s-control-1.tools.eqiad.wmflabs
- tools-k8s-control-2.tools.eqiad.wmflabs
- tools-k8s-control-3.tools.eqiad.wmflabs
profile::toolforge::k8s::ingress_port: 30000
profile::toolforge::k8s::worker_nodes:
- tools-k8s-worker-1.tools.eqiad.wmflabs
- tools-k8s-worker-2.tools.eqiad.wmflabs

NOTE: in case of toolsbeta, the VMs need a security group that allows connectivity between the front proxy (in tools) and haproxy (in toolsbeta). This security group is called k8s-dynamicproxy-to-haproxy and TCP ports should match those in hiera.
NOTE: in the case of initial bootstrap of the k8s cluster, the FQDN k8s.tools.eqiad1.wikimedia.cloud needs to point to the first control node since otherwise haproxy won't see any active backend and kubeadm will fail.

control nodes

The controller nodes are the servers in which the key internal components of kubernetes are running, such as the api-server, scheduler, controller, etc.
There should be 3 control nodes, VMs of at least 2 CPUs and no swap.

The puppet role for the VMs is role::wmcs::toolforge::k8s::control.

Typical hiera configuration:

profile::toolforge::k8s::apiserver_fqdn: k8s.tools.eqiad1.wikimedia.cloud
profile::toolforge::k8s::etcd_nodes:
- tools-k8s-etcd-1.tools.eqiad.wmflabs
- tools-k8s-etcd-2.tools.eqiad.wmflabs
- tools-k8s-etcd-3.tools.eqiad.wmflabs
profile::toolforge::k8s::node_token: m7uakr.ern5lmlpv7gnkacw
swap_partition: false

NOTE: if creating or deleting control nodes, you might want to restart the firewall in etcd nodes.
NOTE: you should reboot the control node VM after the initial puppet run, to make sure iptables alternatives are taken into account by docker and kube-proxy.
NOTE: control and worker nodes require the tools-new-k8s-full-connectivity neutron security group.

bootstrap

With bootstrap we refer to the process of creating the k8s cluster from scratch. In this particular case, there are no control nodes yet. You are installing the first one.

In this initial situation, the FQDN k8s.tools.eqiad1.wikimedia.cloud should point to the initial controller node, since haproxy won't proxy anything to the yet-to-be-ready api-server.
Also, make sure the etcd server is totally fresh and clean, i.e, doesn't store anything from previous clusters.

In the first control server, run the following commands:

root@tools-k8s-control-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@tools-k8s-control-1:~# mkdir -p $HOME/.kube
root@tools-k8s-control-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml 
podsecuritypolicy.policy/privileged-psp created
clusterrole.rbac.authorization.k8s.io/privileged-psp created
rolebinding.rbac.authorization.k8s.io/kube-system-psp created
podsecuritypolicy.policy/default created
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]
root@tools-k8s-control-1:~# kubectl apply -f /etc/kubernetes/toolforge-tool-role.yaml
[...]

After this, the cluster has been boostrapped and has 1 single control node. This should work:

root@tools-k8s-control-1:~# kubectl get nodes
NAME                           STATUS   ROLES    AGE     VERSION
tools-k8s-control-1            Ready    master   3m26s   v1.15.1
root@tools-k8s-control-1:~# kubectl get pods --all-namespaces
NAMESPACE     NAME                                                   READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-9cjml               1/1     Running   0          2m12s
kube-system   calico-node-g4hr7                                      1/1     Running   0          2m12s
kube-system   coredns-5c98db65d4-5wgmh                               1/1     Running   0          2m16s
kube-system   coredns-5c98db65d4-5xmnt                               1/1     Running   0          2m16s
kube-system   kube-apiserver-tools-k8s-control-1                     1/1     Running   0          96s
kube-system   kube-controller-manager-tools-k8s-control-1            1/1     Running   0          114s
kube-system   kube-proxy-7d48c                                       1/1     Running   0          2m15s
kube-system   kube-scheduler-tools-k8s-control-1                     1/1     Running   0          106s

existing cluster

Once the first control node is bootstrapped, we consider the cluster to be existing. But this cluster is designed to have 3 control nodes.
Add aditional control nodes following these steps.

First you need to obtain some data from a pre-existing control node:

root@tools-k8s-control-1:~# grep token: /etc/kubernetes/kubeadm-init.yaml
- token: "m7uakr.ern5lmlpv7gnkacw"
root@tools-k8s-control-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
2a673bbc603c0135b9ada19b862d92c46338e90798b74b04e7e7968078c78de9
root@tools-k8s-control-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
44550243d244837e17ae866e318e5d49e7db978c3a68b71216f541ca6dd18704

Then, in the new control node:

root@tools-k8s-control-2:~# kubeadm join k8s.tools.eqiad1.wikimedia.cloud:6443 --token ${TOKEN_OUTPUT} --discovery-token-ca-cert-hash sha256:${OPENSSL_OUTPUT} --control-plane --certificate-key ${KUBEADM_OUTPUT}
root@tools-k8s-control-2:~# mkdir -p $HOME/.kube
root@tools-k8s-control-2:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config

NOTE: pay special attention to FQDNs and connectivity. You may need to restart ferm after updating the hiera keys in etcd nodes before you can add more control nodes to an existing cluster.
NOTE: In case the token expires, you can generate a new one using a command that I don't remember right now (TODO fix this).
NOTE: control and worker nodes require the tools-new-k8s-full-connectivity neutron security group.

The complete cluster should show 3 control nodes and the corresponding pods in the kube-system namespace:

root@tools-k8s-control-2:~# kubectl get pods --all-namespaces
NAMESPACE     NAME                                                   READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-9cjml               1/1     Running   0          117m
kube-system   calico-node-dfbqd                                      1/1     Running   0          109m
kube-system   calico-node-g4hr7                                      1/1     Running   0          117m
kube-system   calico-node-q5phv                                      1/1     Running   0          108m
kube-system   coredns-5c98db65d4-5wgmh                               1/1     Running   0          117m
kube-system   coredns-5c98db65d4-5xmnt                               1/1     Running   0          117m
kube-system   kube-apiserver-tools-k8s-control-1                     1/1     Running   0          116m
kube-system   kube-apiserver-tools-k8s-control-2                     1/1     Running   0          109m
kube-system   kube-apiserver-tools-k8s-control-3                     1/1     Running   0          108m
kube-system   kube-controller-manager-tools-k8s-control-1            1/1     Running   0          117m
kube-system   kube-controller-manager-tools-k8s-control-2            1/1     Running   0          109m
kube-system   kube-controller-manager-tools-k8s-control-3            1/1     Running   0          108m
kube-system   kube-proxy-7d48c                                       1/1     Running   0          117m
kube-system   kube-proxy-ft8zw                                       1/1     Running   0          109m
kube-system   kube-proxy-fx9sp                                       1/1     Running   0          108m
kube-system   kube-scheduler-tools-k8s-control-1                     1/1     Running   0          117m
kube-system   kube-scheduler-tools-k8s-control-2                     1/1     Running   0          109m
kube-system   kube-scheduler-tools-k8s-control-3                     1/1     Running   0          108m
root@tools-k8s-control-2:~# kubectl get nodes
NAME                           STATUS   ROLES    AGE    VERSION
tools-k8s-control-1            Ready    master   123m   v1.15.1
tools-k8s-control-2            Ready    master   112m   v1.15.1
tools-k8s-control-3            Ready    master   111m   v1.15.1

NOTE: you might want to make sure the FQDN k8s.tools.eqiad1.wikimedia.cloud is pointing to the active haproxy node, since you now have api-servers responding in the haproxy backends.

reconfiguring control plane elements after deployment

Kubeadm doesn't directly reconfigure standing nodes except, potentially, during upgrades. Therefore a change to the init file won't do much for a cluster that is already built. To make a change to some element of the control plane, such as kube-apiserver command-line arguments, you will want to change:

  1. The ConfigMap in the kube-system namespace called kubeadm-config. It can be altered with a command like
    root@tools-k8s-control-2:~# kubectl edit cm -n kube-system kubeadm-config
    
  2. The manifest for the control plane element you are altering, eg. adding a command line argument for kube-apiserver by editing /etc/kubernetes/manifests/kube-apiserver.yaml, which will automatically restart the service.

This should prevent kubeadm from overwriting changes you made by hand later.

NOTE: Remember to change the manifest files on all control plane nodes.

worker nodes

Worker nodes should be created in VM instances with minimun 2 CPUs and Debian Buster as operating system.

The puppet role for them is role::wmcs::toolforge::k8s::worker. No special hiera configuration is required, other than:

swap_partition: false

NOTE: you should reboot the worker node VM after the initial puppet run, to make sure iptables alternatives are taken into account by docker and kube-proxy and calico.
NOTE: control and worker nodes require the tools-new-k8s-full-connectivity neutron security group.

To join a worker node to the cluster, get first a couple of values from the control nodes:

root@tools-k8s-control-1:~# grep token: /etc/kubernetes/kubeadm-init.yaml
- token: "m7uakr.ern5lmlpv7gnkacw"
root@tools-k8s-control-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
44550243d244837e17ae866e318e5d49e7db978c3a68b71216f541ca6dd18704

And then run kubeadm:

root@tools-k8s-worker-1:~# kubeadm join k8s.tools.eqiad1.wikimedia.cloud:6443 --token ${TOKEN_OUTPUT} --discovery-token-ca-cert-hash sha256:${OPENSSL_OUTPUT}

After this, you should see the new worker node being reported:

root@tools-k8s-control-2:~# kubectl get nodes
NAME                           STATUS     ROLES    AGE    VERSION
tools-k8s-control-1            Ready      master   162m   v1.15.1
tools-k8s-control-2            Ready      master   151m   v1.15.1
tools-k8s-control-3            Ready      master   150m   v1.15.1
tools-k8s-worker-1             Ready      <none>   53s    v1.15.1
tools-k8s-worker-2             NotReady   <none>   20s    v1.15.1

NOTE: you should add the new VMs to the profile::toolforge::k8s::worker_nodes hiera key for haproxy nodes.

other components

Once the basic componets are deployed (etcd, haproxy, control, worker nodes), other components should be deployed as well.

ingress

Load a couple of yaml files:

root@tools-k8s-control-2:~# kubectl apply -f /etc/kubernetes/psp/nginx-ingress-psp.yaml 
clusterrolebinding.rbac.authorization.k8s.io/nginx-ingress-psp created

root@tools-k8s-control-2:~# kubectl apply -f /etc/kubernetes/nginx-ingress.yaml 
namespace/ingress-nginx created
configmap/nginx-configuration created
configmap/tcp-services created
configmap/udp-services created
serviceaccount/nginx-ingress created
clusterrole.rbac.authorization.k8s.io/nginx-ingress created
role.rbac.authorization.k8s.io/nginx-ingress created
rolebinding.rbac.authorization.k8s.io/nginx-ingress created
clusterrolebinding.rbac.authorization.k8s.io/nginx-ingress created
deployment.apps/nginx-ingress created
service/ingress-nginx created

This should be done also in the case of updating the yaml files, since puppet won't load them into the cluster automatically.

The nginx-ingress pod should be running shortly after:

root@tools-k8s-control-2:~# kubectl get pods -n ingress-nginx
NAME                            READY   STATUS    RESTARTS   AGE
nginx-ingress-95c8858c9-qqlgd   1/1     Running   0          2d21h

first tool: fourohfour

This should be one of the first tools deployed, since this handles 404 situations for webservices. The kubernetes service provided by this tool is set as the default backend for nginx-ingress.

TODO: describe how to deploy it.

custom admission controllers

Custom admission controllers are webhooks in the k8s API that will do extended checks before the API action is completed. This allows us to enforce certain configurations in Toolforge.

registry admission

This custom admission controller ensures that pods created in the cluster use docker images from our internal docker registry.

Source code for this admission contorller is in https://gerrit.wikimedia.org/r/admin/projects/labs/tools/registry-admission-webhook

TOOD: how do we deploy it?

ingress admission

This custom admission controller ensures that ingress objects in the cluster have a minimal valid configuration. Ingress objects can be arbitrarly created by Toolforge users, and arbitrary routing information can cause disruption to other webservices running in the cluster.

A couple of things this controller enforces:

  • only the toolforge.org or tools.wmflabs.org domains are used
  • only using service backends which belongs to the namespace in which the ingress object belongs to.

Source code for this admission controller is in https://gerrit.wikimedia.org/r/admin/projects/cloud/toolforge/ingress-admission-controller

The canonical instructions for deploying are on the README.md at the repo, and changes to those instructions may appear there first. A general summary follows:

  1. Build the container image locally and copy it to the docker-builder host (currently tools-docker-builder-06.tools.eqiad.wmflabs). The version of docker on there does not support builder containers yet, so it should be built locally with the appropriate tag
    $ docker build . -t docker-registry.tools.wmflabs.org/ingress-admission:latest
    
    and then copied by saving it and using scp to get it on the docker-builder host
    $ docker save -o saved_image.tar docker-registry.tools.wmflabs.org/ingress-admission:latest
    
    and load it into docker there
    root@tools-docker-builder-06:~# docker load -i /home/bstorm/saved_image.tar
    
  2. Push the image to the internal repo
    root@tools-docker-builder-06:~# docker push docker-registry.tools.wmflabs.org/ingress-admission:latest
    
  3. On a control plane node, with a checkout of the repo there somewhere (in a home directory is probably great), as root or admin user on Kubernetes, run
    root@tools-k8s-control-1:# ./get-cert.sh
    
  4. Then run
    root@tools-k8s-control-1:# ./ca-bundle.sh
    
    , which will insert the right ca-bundle in the service.yaml manifest.
  5. Now run
    root@tools-k8s-control-1:# kubectl create -f service.yaml
    
    to launch it in the cluster.

prometheus metrics

We have an external prometheus server (i.e, prometheus is not running inside the k8s cluster). This server is usually tools-prometheus-01.eqiad.wmflabs or other with the same name pattern.

In the k8s cluster side, all is required is:

root@tools-k8s-control-2:~# kubectl apply -f /etc/kubernetes/prometheus_metrics.yaml

See also