This guide assumes you have a basic understanding of the various kubernetes components. If you don't, please refer to https://kubernetes.io/docs/concepts/overview/components/

This guide has been written to instruct a WMF SRE, it is NOT meant to be followed by non-SRE people.

Please have a kickoff meeting with the Kubernetes Special Interest Group (SIG) before spinning up a new cluster. They will help with answering questions you might have regarding the process. Contact point: kubernetes-sig@lists.wikimedia.org

Intro

This is a guide for setting up or reinitializing a new cluster from scratch or almost scratch, using all the already present wikimedia infrastructure. A quick primer:

A vanilla kubernetes is made up of the following components:

etcd
Control plane
- kube-apiserver
- kube-controller-manager
- kube-scheduler
Node
- kube-proxy
- kubelet

Note that upstream documents also refer to another control-plane component, namely cloud-controller-manager. We don't run cloud-controller-manager as we are not in a cloud.

In our infrastructure the first 3 components (kube-apiserver, kube-controller-manager, kube-scheduler) are assumed to be collocated on the same servers and talk over localhost. Kubelet and kube-proxy are assumed to be collocated on every kubernetes node. etcd is assumed to be on 3 nodes that are dedicated and different from all the others. Those assumptions might be attacked at some point and things changed, these docs will be updated when that happens.

Our services/main cluster also uses calico as CNI (container networking interface) and helm as a deployment tool. Those are covered as well in the networking and deployment sections.

Versions

Kubernetes versioning is important and brutal. You might want to have a peek at our kubernetes components upgrade policy Kubernetes/Kubernetes_Infrastructure_upgrade_policy

This guide currently covers kubernetes 1.23 with calico 3.23

Prerequisites

Make sure you accept the restrictions about the versions above.
Allocate IP spaces for your cluster.
- Calculate the maximum amount of pods you want to support and figure out using a subnet calculator (e.g. sipcalc) what IPv4 subnet you require (e.g. if you want a 100 pods, 128 pod IPs should be ok, so a /25 is enough). If you plan on max 1000 pods, you need 4 /24s (256 IPs) so a /22. Allocate them as active in Netbox. We can always add more pools after, but with IPv4 it's better to keep things a bit tidied. Don't forget IPv6. Allocate a /64. It should be enough regardless of amount of pods and will allow for growth.
- Calculate the maximum amount of services you want to have (obviously it will be smaller than the amount of pods. Unless you plan to expose >250 services a /24 should be more than enough). Allocate it in Netbox. Don't forget IPv6. Allocate a /64. It should be enough regardless of growth

helmfile.d structure

We use extensively helmfile for all deployments, including creating all the cluster configuration.

Clone "https://gerrit.wikimedia.org/r/operations/deployment-charts" and navigate to helmfile.d/admin_ng/values hierarchy. The directories there are 1 per cluster. Copy one of those and amend it to fit your cluster.

This change may be merged at any time prior to bootstrapping the cluster.

There are at least three files that WILL require alteration:

calico-values.yaml
coredns-values.yaml
cfssl-issuer-values.yaml

File calico-values.yaml

BGPConfiguration:
  asNumber: 64602
  nodeToNodeMeshEnabled: false

IPPools:
  # These are the IP spaces you reserved for the cluster. It of course varies per DC
  ipv4-1:
    cidr: "myipv4/24"
  ipv6:
    cidr: "myipv6/64"

File coredns-values.yaml

service:
  # This is the cluster level IP that coredns will listen on. It MUST be in the service ip range you reserved previously and it MUST NOT be the very first one (.1) as that is internally used by kubernetes
  clusterIP: X.Y.Z.W

File cfssl-issuer-values.yaml

  # This is a reference to a new signing profile for the discovery intermediate CA. It will be used with cfssl-signer to issue certificates to pods.
  issuers:
    discovery:
      profile: <myclustername>

Note: You must follow the PKI/CA_Operations#Adding_a_new_signing_profile steps to add the above signing profile to the discovery intermediate CA.

Components

PKI intermediates

Kubernetes makes heavy use of the PKI infrastructure for both encrypting and authenticating communication between it's components.

For a new kubernetes cluster (or cluster group), two intermediates need to be created following the documentation at PKI/CA Operations:

<cluster (group) name>
<cluster (group) name>_front_proxy

See the config for wikikube clusters as an example:

  wikikube:
    # Main CA for the wikikube kubernetes cluster
    # https://v1-23.docs.kubernetes.io/docs/setup/best-practices/certificates/#all-certificates
    ocsp_port: <use a free port, leave some space in between>
    profiles:
      # Keys with this profile are used to sign/verify service account tokens so
      # there is no need for server or client auth.
      service-account-management:
        usages:
          - 'digital signature'
          - 'key encipherment'
  wikikube_front_proxy:
    # Separate CA for the front proxy, using the same as for client-auth won't work:
    # https://v1-23.docs.kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/#ca-reusage-and-conflicts
    # Kubernetes will only use the default profile for client auth certs
    ocsp_port: <port from above += 1>

etcd

etcd is a distributed datastore using the Raft algorithm for consensus. It is used by kubernetes to store cluster configuration as well as deployment data. In WMF it is also used for pybal, so there is some knowledge.

Depending on the critically of your new cluster, request an odd (recommended value is 3) number of small VMs on phabricator vm-requests project via SRE_Team_requests#Virtual_machine_requests_(Production). Then use Ganeti to create those VMs. NOTE: for etcd on ganeti drbd needs to be disabled. After provisioning follow the dedicated Etcd guide.

Control-plane

Servers

The control plane houses kube-apiserver, kube-controller-manager, kube-scheduler. For this guide kube-controller-manager and kube-scheduler are assumed to talk to localhost kube-apiserver. If > 1 control-plane nodes exists, those 2 components will perform elections over the API about which is the main one at any given point in time (detection and failover is automatic).

Depending on the criticality of having the control plane always working request 1 or 2 small VMs on phabricator vm-requests project. Then use Ganeti to create those VMs.

Puppet/hiera

In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer.

Create a new role for your nodes. The best way forward is to copy role::kubernetes::staging::master and set a proper system::role description. Something like the following should be good enough

class role::foo::master {
    include profile::base::production
    include profile::base::firewall

    # Sets up kubernetes on the machine
    include profile::kubernetes::master
    include profile::docker::engine
    include profile::kubernetes::node
    include profile::calico::kubernetes

    system::role { 'kubernetes::master':
        description => 'Kubernetes master server',
    }
}

If you are going to have >1 control plane nodes, add profile::lvs::realserver to the list of profiles included.

The new clusters configuration needs to be added to hieradata/common/kubernetes.yaml , please see modules/k8s/types/clusterconfig.pp for the parameter documentation.

If you're creating a new cluster group (e.g. new purpose cluster), add the cluster group and your cluster config inside of it.

All of the above can be done in one patch while using the puppet compiler

LVS

only needed if >1 control plane nodes have been created

Follow LVS#Add a new load balanced service

Node (worker)

This setup is meant (and achieves) to provide a hands off approach to node provisioning/reprovisioning/imaging etc. That is from the moment the node is declared ready to be put in service and the puppet role (and respective hiera) has been applied, a single re-image should suffice for the node the registered to the API and be ready to receive traffic.

Notes

The setup has only been tested with the specific partman recipe present in partman/custom/kubernetes-node-overlay.cfg.
docker is mean to be used as the CRE. Other runtime engines aren't currently supported.
We now use the overlayfs docker storage driver for masters and workers.
The CNI of choice is Calico and it is deployed via a Kubernetes Daemonset. A node component is running on every node and is the one providing connectivity to pods. Failure of that components means pods have no connectivity.

General Puppet/hiera setup

In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer.

Create a new role for your nodes. The best way forward is to copy role::kubernetes::staging::worker and set a proper system::role description. Something like the following should be good enough:

class role::foo::worker {
    include profile::base::production
    include profile::base::firewall

    # Sets up docker on the machine
    include profile::docker::engine
    # (Optional) Setup dfdaemon and configure docker to use it
    #include profile::dragonfly::dfdaemon
    # Setup kubernetes stuff
    include profile::kubernetes::node
    # Setup calico
    include profile::calico::kubernetes
    # (Optional) Setup LVS
    #include profile::lvs::realserver

    system::role { 'foo::worker':
        description => 'foo worker node',
    }
}

In case you expect to expose services via LVS, add profile::lvs::realserver in the list of profiles you include.

If you want to use the Dragonfly p2p layer for pulling container images, include profile::dragonfly::dfdaemon.

Like with contol planes, the hiera config for nodes goes into hieradata/common/kubernetes.yaml, please consult the parameter documentation in modules/k8s/types/clusterconfig.pp.

Access to restricted docker images

If your nodes need access to restricted docker images (see: T273521 for context), you have provide credentials for the docker registry to your nodes. This can be done by adding the hiera key profile::kubernetes::node::docker_kubernetes_user_password to the file hieradata/role/common/foo/worker.yaml in in the private puppet repository.

See Docker-registry#Access_control on how to find the correct password.

Because of the way docker works, you will need to ensure a puppet run on all docker registry nodes after puppet has run the kubernetes nodes with docker registry credentials set. See 672537 for details.
sudo cumin -b 2 -s 5 'A:docker-registry' 'run-puppet-agent -q'

Adding Nodes

For adding nodes (based on the generic setup described above) please follow Kubernetes/Clusters/Add_or_remove_nodes

After the re-image the nodes will NOT be automatically added to the cluster if you have never applied helmfile.d/admin_ng, see Kubernetes/Clusters/New#Apply RBAC rules and PSPs. You only need to do that once

Apply RBAC rules and PSPs

If you have your helmfile.d/admin_ng ready you can apply at least RBAC and Pod Security Policies

Note: these commands need to be run as logged-in root (just prefixing them with sudo will not work).

$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=rbac-rules sync
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=pod-security-policies sync

After this stage your nodes will registered to the API, but will not be ready to receive pods, cause you lack the next section.

Label Kubernetes Masters

For some clusters there is the need to add specific node labels to identify roles, for example the master nodes part of the control plane:

Note: these commands need to be run as logged-in root (just prefixing them with sudo will not work). As root, you may need to run kube_env admin somecluster as well

kubectl label nodes ml-serve-ctrl1001.eqiad.wmnet node-role.kubernetes.io/control-plane=""
kubectl label nodes ml-serve-ctrl1002.eqiad.wmnet node-role.kubernetes.io/control-plane=""

Due to https://github.com/kubernetes/kubernetes/issues/84912#issuecomment-551362981, we cannot add the above labels to the ones set by the Kubelet when registering the node, so this step needs to be done manually when bootstrapping the cluster. Labels in the kubernetes.io namespace cannot be added to Puppet afterwards on K8s 1.16, the kubelet will complain:

--node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the specifically allowed set (beta.kubernetes.io/arch, beta.kubernetes.io/instance-type, beta.kubernetes.io/os, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone, failure-domain.kubernetes.io/region, failure-domain.kubernetes.io/zone, kubernetes.io/arch, kubernetes.io/hostname, kubernetes.io/instance-type, kubernetes.io/os)

The labels will be useful for NetworkPolicies, for example to identify traffic coming from the master nodes towards a certain pod (likely a webhook).

Namespaces

Namespaces are populated in a pretty opinionated way in the main clusters, populating limitRanges, resourceQuotas and tillers per namespace

Namespaces are created using helmfile and the main clusters (production + staging) all share them, however they are overrideable per cluster. The main key is at helmfile.d/admin_ng. An example of augmenting it is at helmfile.d/admin_ng/staging

The same structure also holds limitRanges and resourceQuotas. Note that it's a pretty opinionated way

Creating them is done with the following command on the deployment host:

# remember to root-login via sudo -i first
helmfile -e staging-codfw -l name=namespaces sync

Which means that if you don't want the main namespaces populated (which makes sense), your best bet is to skip running that command. Alternatively override the main values for your cluster.

Networking

First of all, have a look in Network design for how a DC (not a caching pop) is cable network wise. It will help get an idea of what it is you are going to be doing in this section.

What we are going to do in this section is have the nodes talk BGP to the cr*-<site> core routers (aka the juniper routers) or the server's top of rack switch and vice versa (it's a bidirectional protocol).

Network side

Please reach out to Netops if you need any help.

Reserve BGP AS numbers in Netbox.
Send a patch to add an entry in this HOSTNAMES_TO_GROUPS dictionary https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#16
Send a patch to add a matching file in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/includes/bgp/
Once reviewed by Netops, deploy them using Homer (at this point it should be a NOOP)
Follow the Kubernetes/Clusters/Add or remove nodes#Step 1: Add node to BGP steps.
Don't worry about BGP alerts, the important bit is doing this step and the next one one after the other (to establish eBGP sessions).

Calico node/controllers

Now you can deploy all calico components

At this stage you can probably deploy the entire helmfile.d structure in 1 go but since RBAC/PSPs are already covered above we are going to just mention calico here.

# remember to root-login via sudo -i first
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico-crds sync
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico sync

Note: if the second command times out, you may need to sync the namespaces first.

There are dependencies between the 2 so you don't really need this level of release by relase, but for clarity:

The CRDs (Custom Resource Definitions) are calico's way of storing its' data in the Kubernetes API
The calico release itself will setup a calico-node pod in every node with hostNetwork: true (that is it will not have its own IP address but rather share it with the host), 1 calico typha pod and 1 calico kube-controllers pod.

If this succeeds, you are almost ready to deploy workloads, but have a look for 2 rather crucial cluster tools below.

To check if calico works:

root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset
NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers   1/1     1            1           2m29s
deployment.apps/calico-typha              1/1     1            1           2m29s

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/calico-node   4         4         4       4            4           kubernetes.io/os=linux   2m29s

And if you want to check on the routers, ssh to one of them (like cr1-eqiad.wikimedia.org) and run the following:

$ show bgp neighbor
[..]
  Description: ml-serve1002             
  Group: Kubemlserve4          Routing-Instance: master
  Forwarding routing-instance: master  
  Type: External    State: Established    Flags: <Sync>
  Last State: OpenConfirm   Last Event: RecvKeepAlive
  Last Error: None
[..]

You should see an established session for all the k8s workers of your cluster.

Puppet

If you are adding a new cluster, you'll probably need to add the configuration needed to update firewall rules for various services (for example, to allow the new pod IPs to contact services in production). Follow https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933 to figure out what to add.

Cluster tools

CoreDNS

CoreDNS is the deployment and service that provides outgoing DNS resolution to pods as well as internal DNS discovery. It is NOT used by deployments that are hostNetwork: true (e.g. calico-node) in our setup on purpose.

Assuming you populated the helmfile.d/admin_ng/values/<cluster>/ it can be populated with

helmfile -e <mycluster> -l name=coredns sync

To check that everything is up and running as expected:

root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset
NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers   1/1     1            1           15m
deployment.apps/calico-typha              1/1     1            1           15m
deployment.apps/coredns                   4/4     4            4           49s

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/calico-node   4         4         4       4            4           kubernetes.io/os=linux   15m

After the deployment of the coredns pods, you are free to merge a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/673985 to configure the coredns service ip to all kubelets/workers.

Eventrouter

Eventrouter aggregates and sends to logstash kubernetes events.

Deploy it with on the deployment host:

# remember to root-login via sudo -i first
helmfile -e <mycluster> -l name=eventrouter sync

To check that everything is up and running as expected, see what's done for coredns.

Istio

For Istio Ingress gateways, the istio_gateways feature toggle needs to be set to true in helmfile.d/admin_ng/helmfile.yaml. That will deploy the required Namespace and NetworkPolicies.

For Istio Sidecar proxies (TLS mesh etc..), the istio_sidecar_proxy feature toggle needs to be set to true in helmfile.d/admin_ng/helmfile.yaml. That will deploy the required NetworkPolicies.

Istio itself needs to be deployed afterwards using istioctl, see Kubernetes/Ingress.

# remember to root-login via sudo -i first
# Sync networkpolicies first
helmfile -e <my-cluster> -l name=istio-gateways-networkpolicies sync

# If your cluster uses the proxy sidecars / mesh
helmfile -e <my-cluster> -l name=istio-proxy-settings sync

# Then use istioctl to deploy the Istio configs
kube_env admin <my-cluster>
istioctl-X.X manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/<your-cluster>/config.yaml

cert-manager

For automatic certificate management, enable the feature toggle install_cert_manager in helmfile.d/admin_ng/helmfile.yaml. Please also read Kubernetes/cert-manager for additional requirements in PKI infrastructure, private puppet etc.

The namespace_certificates feature toggle can be used together with Istio-Ingressgateway to create a default Certificate per Namespace (see Kubernetes/cert-manager#Istio-Ingressgateway)

# remember to root-login via sudo -i first
helmfile -e <mycluster> -l name=cert-manager-networkpolicies sync
helmfile -e <mycluster> -l name=cert-manager sync
helmfile -e <mycluster> -l name=cfssl-issuer-crds sync
helmfile -e <mycluster> -l name=cfssl-issuer sync

# If Istio TLS certificates are issued from cert-manager
helmfile -e <mycluster> -l name=namespace-certificates sync

Prometheus

Before enabling scraping you will need to create the LVM volumes manually

Prometheus talks to the api and discovers the API server, nodes, pods, endpoints and services. In WMF we only scrape the API server, the nodes and the pods. We have 2 nodes per DC doing the scraping. Those will need to be properly configured to scrape the new cluster.

The configuration is done in the prometheus key (see modules/k8s/types/clusterconfig/prometheus.pp for the parameter documentation) of hieradata/common/kubernetes.yaml.

You just need to specify a port (count up, puppet will fail if it's already used) and the names of your puppet roles for nodes and control planes (for service discovery).

After running puppet on the prometheus nodes, please verify on them that the new systemd units are working as expected. Once done, you can follow up with:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/674279 (This requires a reload for apache2 on the prometheus nodes to pick up the new config. Please sync with Observability before doing anything).
https://gerrit.wikimedia.org/r/c/operations/puppet/+/674313

The first code review also need a puppet run on the grafana nodes to pick up the new config. Once done, you should be able to see the new cluster in the Kubernetes Grafana dashboards!

LVM creation

This is unfortunately currently manual, requiring the creation of lvm volumes on multiple prometheus nodes (O:prometheus in cumin):

prometheus100[5,6].eqiad.wmnet
prometheus200[5,6].codfw.wmnet

Please follow up with a member of the Observability team first to let them know what you are doing, so they are aware.

Edit modules/prometheus/files/provision-fs.sh in puppet to add your instance and its initial volume size, merge, run puppet and then the script on the hosts above.