Kubernetes/Deployments

From Wikitech
(Redirected from Kubernetes/Helm)
This page is strictly about day to day deployments, not first deployment of a service. Look in Deployment pipeline for first deployments.

Deployments on kubernetes happen using helmfile.

Deploying with helmfile

Code deployment/configuration changes

Note that both new code deployments as well as configuration changes are considered a deployment!

  1. Clone deployment-charts repo.
  2. Using your editor modify under the helmfile.d folder of the service you want to modify. As an example, myservice deployment lives under deployment-charts/helmfile.d/services/myservice. Most of the changes are usually made on the values.yaml and the values-*.yaml files to tune the deployment parameters.
  3. If you need to update or add a secret like a password or a certificate ask an SRE to commit it into the private puppet repo do not commit secrets in deployment-charts repo.
  4. Make a CR and after a successful review merge it. Note: SRE may offer +1 to your patch and that is sufficient to self-merge and deploy (see the notes about deployment changes in https://www.mediawiki.org/wiki/Gerrit/Privilege_policy#Merging_without_review)
  5. After merge, log in to a deployment server, there is a cron (1 minute) that will update the /srv/deployment-charts directory with the contents from git.
  6. Go to /srv/deployment-charts/helmfile.d/services/${SERVICE} where SERVICE is the name of your service i.e myservice.
  7. execute helmfile -e ${CLUSTER} -i apply --context 5, where $CLUSTER is the k8s cluster you're operating on, currently one of staging, eqiad and codfw. This will show the changes that will be applied on the cluster and prompt you to confirm. The --context 5 flag allows for a more compact diff (by default it will display the whole rendered resources). It will then materialize the previous diff in the cluster and will log the change into SAL.
    Consider that the diff generated by helmfile may contain sensitve information like passwords and API keys. Use caution when sharing the output.
  8. all done!

In case there are multiple releases of your service in the same helmfile, you can use the --selector name=RELEASE_NAME option, e.g. helmfile -e $CLUSTER --selector name=test -i apply --context 5.

Release breaking changes

In some cases (TODO: add a list of cases?), changes to your chart will not be able to be applied by helmfile. In these cases, you will need to destroy and recreate the deployments, which involves doing a little DNS pooling dance.

The following instructions require production root access to complete. Only Wikimedia SREs or equivalent users can follow this process.
  1. Depool your service from codfw sudo cookbook sre.discovery.service-route depool codfw service-foo
  2. Watch your dashboards and wait for traffic to die out
  3. Destroy your services codfw deployments cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e codfw -i destroy
  4. Wait for a bit. helmfile destroyreturns before all actions are done on the kubernetes cluste. You will encounter the following error or similar if you recreate the deployments too soon: Error: release production failed, and has been uninstalled due to atomic being set: Service "service-foo-production-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: $service-port: provided port is already allocated
  5. Recreate your deployment in codfw: helmfile -e codfw -i apply --context 5
  6. Repool your service in codfw: sudo cookbook sre.discovery.service-route pool codfw service-foo
  7. Watch traffic come back to codfw, then GOTO1 for eqiad

Seeing the current status

This is done using helmfile

  1. Change directory to /srv/deployment-charts/helmfile.d/services/${SERVICE} on a deployment server
  2. Unless you are mid un-applied changes the current values files should reflect the deployed values
  3. You can check for unapplied changes with: helmfile -e $CLUSTER diff --context 5 (again, the --context option allows you to tune the amount of context surrounding your changes)
  4. You can see the status with helmfile -e $CLUSTER status

Rolling back changes

If you need to roll back a change because something went wrong:

  1. Revert the git commit to the deployment-charts repo
  2. Merge the revert (with review if needed)
  3. Wait one minute for the cron job to pull the change to the deployment server
  4. Change directory to /srv/deployment-charts/helmfile.d/services/${SERVICE} where SERVICE is the name of your service
  5. execute helmfile -e $CLUSTER diff --context 5
  6. execute helmfile -e $CLUSTER apply

Rolling back in an emergency

If you can't wait the one minute, or the cron job to update from git fails etc. then it is possible to manually roll back using helm. This is discouraged over using helmfile though.

The following instructions require production root access to complete. Only Wikimedia SREs or equivalent users can follow this process.
  1. Find the revision to roll back to
    1. sudo -i
    2. kube_env admin $CLUSTER; helm3 -n $SERVICE history $RELEASE
    3. Find the revision to roll back to
    4. e.g. perhaps the penultimate one
      REVISION        UPDATED                         STATUS          CHART           DESCRIPTION     
      1               Tue Jun 18 08:39:20 2019        SUPERSEDED      termbox-0.0.2   Install complete
      2               Wed Jun 19 08:20:42 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      3               Wed Jun 19 10:33:34 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      4               Tue Jul  9 14:21:39 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      
  2. Rollback with (still sudo -i): kube_env admin $CLUSTER; helm3 rollback -n $SERVICE $RELEASE 3

Rolling restart

If you want to force all PODs of your deployment to restart, you can use the roll_restart parameter during deployment with helmfile:

helmfile -e $CLUSTER --state-values-set roll_restart=1 sync

Undeploy/delete a release

You may undeploy/delete your service completely using:

helmfile -e $CLUSTER destroy

If you want to undeploy/delete just a specific release of your service, use a selector like:

helmfile -e $CLUSTER --selector name=$RELEASE_NAME destroy

Advanced use cases: using kubeconfig

If you need to use kubeconfig (for a port-forward or to get logs for debugging) you can execute kube_env $SERVICE $CLUSTER; kubectl COMMAND, e.g. kube_env myservice staging; kubectl logs POD_NAME -c CONTAINER_NAME for logs.

Advanced use cases: using helm

Sometimes you might need to use helm, this is completely discouraged use it only at your own risk and in emergencies. It assumes that you know what you are doing using helm.

  • kube_env <service> <cluster>
  • helm <command>

Example:

akosiaris@deploy1002:~$ kube_env mathoid eqiad
akosiaris@deploy1002:~$ helm list
NAME      	REVISION	UPDATED                 	STATUS  	CHART         	APP VERSION	NAMESPACE
production	1       	Tue Mar 23 10:37:50 2021	DEPLOYED	mathoid-0.0.35	           	mathoid   
akosiaris@deploy1002:~$ helm status
Error: release name is required
akosiaris@deploy1002:~$ helm status production
LAST DEPLOYED: Tue Mar 23 10:37:50 2021
NAMESPACE: mathoid
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME                                    DATA  AGE
config-production                       1     26d
mathoid-production-envoy-config-volume  1     26d
mathoid-production-tls-proxy-certs      2     26d
production-metrics-config               1     26d
==> v1/Deployment
NAME                READY  UP-TO-DATE  AVAILABLE  AGE
mathoid-production  30/30  30          30         26d
==> v1/NetworkPolicy
NAME                POD-SELECTOR                    AGE
mathoid-production  app=mathoid,release=production  26d 
==> v1/Pod(related)
NAME                                 READY  STATUS   RESTARTS  AGE
mathoid-production-64787b97c5-24pzw  3/3    Running  0         26d
...
mathoid-production-64787b97c5-z74n2  3/3    Running  0         26d
==> v1/Service
NAME                            TYPE      CLUSTER-IP    EXTERNAL-IP  PORT(S)          AGE
mathoid-production              NodePort  10.64.72.227  <none>       10044:10042/TCP  26d
mathoid-production-tls-service  NodePort  10.64.72.35   <none>       4001:4001/TCP    26d

When `helmfile apply` Does Nothing

In T347521, an application was in a state where `kubectl get pod` and `kubectl get deploy` showed no resources, but `helmfile apply` did nothing. Looking with `kubectl get networkpolicy`, we were able to see that the application was in a partially-deployed state. Running `helmfile destroy` and `helmfile apply` was enough to recover the application.


Cheatsheet

Deploying a change

  1. +2 your change on https://gerrit.wikimedia.org/g/operations/deployment-charts and wait for Jenkins to merge it.
  2. Login to the active deployment server: $ ssh deployment.eqiad.wmnet
  3. Apply the helm chart to all 3 clusters:
    $ cd /srv/deployment-charts/helmfile.d/services/${SERVICE}
    $ helmfile -e staging -i apply --context 5
    $ helmfile -e eqiad -i apply --context 5
    $ helmfile -e codfw -i apply --context 5
    

See also