Kubernetes/Deployments

This page is strictly about day to day deployments, not first deployment of a service. Look in Deployment pipeline for first deployments.

Deployments on kubernetes happen using helmfile.

Deploying with helmfile

Code deployment/configuration changes

Note that both new code deployments as well as configuration changes are considered a deployment!

Clone deployment-charts repo.
Using your editor modify under the helmfile.d folder of the service you want to modify. As an example, myservice deployment lives under deployment-charts/helmfile.d/services/myservice. Most of the changes are usually made on the values.yaml and the values-*.yaml files to tune the deployment parameters.
If you need to update or add a secret like a password or a certificate ask an SRE to commit it into the private puppet repo do not commit secrets in deployment-charts repo.
Make a CR and after a successful review merge it. Note: SRE may offer +1 to your patch and that is sufficient to self-merge and deploy (see the notes about deployment changes in https://www.mediawiki.org/wiki/Gerrit/Privilege_policy#Merging_without_review)
After merge, log in to a deployment server, there is a cron (1 minute) that will update the /srv/deployment-charts directory with the contents from git.
Go to /srv/deployment-charts/helmfile.d/services/${SERVICE} where SERVICE is the name of your service i.e myservice.
execute helmfile -e ${CLUSTER} -i apply --context 5, where $CLUSTER is the k8s cluster you're operating on, currently one of staging, eqiad and codfw. This will show the changes that will be applied on the cluster and prompt you to confirm. The --context 5 flag allows for a more compact diff (by default it will display the whole rendered resources). It will then materialize the previous diff in the cluster and will log the change into SAL.
Consider that the diff generated by helmfile may contain sensitve information like passwords and API keys. Use caution when sharing the output.
all done!

In case there are multiple releases of your service in the same helmfile, you can use the --selector name=RELEASE_NAME option, e.g. helmfile -e $CLUSTER --selector name=test -i apply --context 5.

Release breaking changes

In some cases (TODO: add a list of cases?), changes to your chart will not be able to be applied by helmfile. In these cases, you will need to destroy and recreate the deployments, which involves doing a little DNS pooling dance.

The following instructions require production root access to complete. Only Wikimedia SREs or equivalent users can follow this process.

Depool your service from codfw sudo cookbook sre.discovery.service-route depool codfw service-foo
Watch your dashboards and wait for traffic to die out
Destroy your services codfw deployments cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e codfw -i destroy
Wait for a bit. helmfile destroyreturns before all actions are done on the kubernetes cluste. You will encounter the following error or similar if you recreate the deployments too soon: Error: release production failed, and has been uninstalled due to atomic being set: Service "service-foo-production-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: $service-port: provided port is already allocated
Recreate your deployment in codfw: helmfile -e codfw -i apply --context 5
Repool your service in codfw: sudo cookbook sre.discovery.service-route pool codfw service-foo
Watch traffic come back to codfw, then GOTO1 for eqiad

Seeing the current status

This is done using helmfile

Change directory to /srv/deployment-charts/helmfile.d/services/${SERVICE} on a deployment server
Unless you are mid un-applied changes the current values files should reflect the deployed values
You can check for unapplied changes with: helmfile -e $CLUSTER diff --context 5 (again, the --context option allows you to tune the amount of context surrounding your changes)
You can see the status with helmfile -e $CLUSTER status

Rolling back changes

If you need to roll back a change because something went wrong:

Revert the git commit to the deployment-charts repo
Merge the revert (with review if needed)
Wait one minute for the cron job to pull the change to the deployment server
Change directory to /srv/deployment-charts/helmfile.d/services/${SERVICE} where SERVICE is the name of your service
execute helmfile -e $CLUSTER diff --context 5
execute helmfile -e $CLUSTER apply

Rolling back in an emergency

If you can't wait the one minute, or the cron job to update from git fails etc. then it is possible to manually roll back using helm. This is discouraged over using helmfile though.

The following instructions require production root access to complete. Only Wikimedia SREs or equivalent users can follow this process.

Find the revision to roll back to

sudo -i
kube_env admin $CLUSTER; helm3 -n $SERVICE history $RELEASE
Find the revision to roll back to

e.g. perhaps the penultimate one

REVISION        UPDATED                         STATUS          CHART           DESCRIPTION     
1               Tue Jun 18 08:39:20 2019        SUPERSEDED      termbox-0.0.2   Install complete
2               Wed Jun 19 08:20:42 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
3               Wed Jun 19 10:33:34 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
4               Tue Jul  9 14:21:39 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete

Rollback with (still sudo -i): kube_env admin $CLUSTER; helm3 rollback -n $SERVICE $RELEASE 3

Rolling restart

If you want to force all PODs of your deployment to restart, you can use the roll_restart parameter during deployment with helmfile:

helmfile -e $CLUSTER --state-values-set roll_restart=1 sync

Undeploy/delete a release

You may undeploy/delete your service completely using:

helmfile -e $CLUSTER destroy

If you want to undeploy/delete just a specific release of your service, use a selector like:

helmfile -e $CLUSTER --selector name=$RELEASE_NAME destroy

Advanced use cases: using kubeconfig

If you need to use kubeconfig (for a port-forward or to get logs for debugging) you can execute kube_env $SERVICE $CLUSTER; kubectl COMMAND, e.g. kube_env myservice staging; kubectl logs POD_NAME -c CONTAINER_NAME for logs.

Advanced use cases: using helm

Sometimes you might need to use helm, this is completely discouraged use it only at your own risk and in emergencies. It assumes that you know what you are doing using helm.

kube_env <service> <cluster>
helm <command>

Example:

akosiaris@deploy1002:~$ kube_env mathoid eqiad
akosiaris@deploy1002:~$ helm list
NAME      	REVISION	UPDATED                 	STATUS  	CHART         	APP VERSION	NAMESPACE
production	1       	Tue Mar 23 10:37:50 2021	DEPLOYED	mathoid-0.0.35	           	mathoid   
akosiaris@deploy1002:~$ helm status
Error: release name is required
akosiaris@deploy1002:~$ helm status production
LAST DEPLOYED: Tue Mar 23 10:37:50 2021
NAMESPACE: mathoid
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME                                    DATA  AGE
config-production                       1     26d
mathoid-production-envoy-config-volume  1     26d
mathoid-production-tls-proxy-certs      2     26d
production-metrics-config               1     26d
==> v1/Deployment
NAME                READY  UP-TO-DATE  AVAILABLE  AGE
mathoid-production  30/30  30          30         26d
==> v1/NetworkPolicy
NAME                POD-SELECTOR                    AGE
mathoid-production  app=mathoid,release=production  26d 
==> v1/Pod(related)
NAME                                 READY  STATUS   RESTARTS  AGE
mathoid-production-64787b97c5-24pzw  3/3    Running  0         26d
...
mathoid-production-64787b97c5-z74n2  3/3    Running  0         26d
==> v1/Service
NAME                            TYPE      CLUSTER-IP    EXTERNAL-IP  PORT(S)          AGE
mathoid-production              NodePort  10.64.72.227  <none>       10044:10042/TCP  26d
mathoid-production-tls-service  NodePort  10.64.72.35   <none>       4001:4001/TCP    26d

When `helmfile apply` Does Nothing

In T347521, an application was in a state where kubectl get pod and kubectl get deploy showed no resources, but helmfile apply did nothing. Looking with kubectl get networkpolicy, we were able to see that the application was in a partially-deployed state. Running helmfile destroy and helmfile apply was enough to recover the application.

`helmfile destroy` without root permissions

If you need to destroy a release, you can use kube_env to become the deploy user (might require global root/SRE-level access, please update if so). kube_env ${namespace}-deploy ${env} ˆ. This will allow you to destroy a release with the minimum amount of permissions.

Cheatsheet

Deploying a change

+2 your change on https://gerrit.wikimedia.org/g/operations/deployment-charts and wait for Jenkins to merge it.
Login to the active deployment server: $ ssh deployment.eqiad.wmnet

Apply the helm chart to all 3 clusters:

$ cd /srv/deployment-charts/helmfile.d/services/${SERVICE}
$ helmfile -e staging -i apply --context 5
$ helmfile -e eqiad -i apply --context 5
$ helmfile -e codfw -i apply --context 5

Rendering the helmfile file itself

Usually, a chart helmfile.yaml file includes some templating logic, which can, like any logic, induce bugs. To inspect the fully rendered version of the helmfile, you can use helmfile -e $CLUSTER build.

rouberol@deploy1002:/srv/deployment-charts/helmfile.d/dse-k8s-services/superset-next$ helmfile -e dse-k8s-eqiad build
helmfile.yaml: basePath=.
---
#  Source: helmfile.yaml

filepath: helmfile.yaml
helmBinary: helm3
environments:
  dse-k8s-eqiad:
    values:
    - releases:
      - staging
    missingFileHandler: Warn
helmDefaults:
  tillerless: false
  args:
  - --kubeconfig
  - /etc/kubernetes/superset-next-deploy-dse-k8s-eqiad.config
  verify: false
  devel: false
  wait: false
  timeout: 600
  recreatePods: false
  force: false
  atomic: true
...

Rendering the final values overlay

In the helmfile, we define an order of precedence for values files, from lowest to highest. These values are then "overlaid" on top of each other into a single value file, which is then passed to the chart for rendering. Sometimes, we make typos, or a value isn't indented enough, which results in chart rendering issues (maybe the wrong value is passed to the chart). If you're facing such a bug, you can use the helmfile -e $CLUSTER write-values command to generate the final overlaid values file, that you can then inspect for mistakes.

brouberol@deploy1002:/srv/eployment-charts/helmfile.d/dse-k8s-services/superset-next$ sudo helmfile -e dse-k8s-eqiad write-values
helmfile.yaml: basePath=.
skipping missing values file matching "values.yaml"
Writing values file helmfile-638f63f0/staging.yaml # one file per release
helmfile.yaml: basePath=.
brouberol@deploy1002:/srv/deployment-charts/helmfile.d/dse-k8s-services/superset-next$ sudo cat helmfile-638f63f0/staging.yaml | head -n 20
_dbstore_egress_ports:
  ports:
  - port: 3320
    protocol: tcp
  - port: 3350
    protocol: tcp
_mariadb_egress_ports:
  ports:
  - port: 3306
    protocol: tcp
app:
  version: ea12de870b81a7c735701ae0e23d0f416fb2bfc9-production-backend
assets:
  version: ea12de870b81a7c735701ae0e23d0f416fb2bfc9-production-frontend
common_images:
  httpd:
    exporter: prometheus-apache-exporter:0.0.3-20231015
  kerberos:
    version: 45377f59c5bdf8bae1b967c49ee29a144c5cba44-production
  mcrouter:
...

Note that this will create a directory containing a flat values file per helmfile release. Don't forget to remove it afterwards, as /srv/deployment-charts is a git repository.

Rendering a chart

To render a chart using its associated helmfile, you can simply run helmfile -e $CLUSTER template.

brouberol@deploy1002:/srv/deployment-charts/helmfile.d/dse-k8s-services/superset-next$ helmfile -e dse-k8s-eqiad template | head -n 20
helmfile.yaml: basePath=.
skipping missing values file matching "values.yaml"
Templating release=staging, chart=wmf-stable/superset
---
# Source: superset/templates/networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: superset-staging
  labels:
    app: superset
    chart: superset-0.0.19
    release: staging
    heritage: Helm
spec:
  podSelector:
    matchLabels:
      app: superset
      release: staging
  policyTypes:
    - Egress
    - Ingress
  ingress:
helmfile.yaml: basePath=.