Kubernetes/Enabling TLS

From Wikitech
Jump to navigation Jump to search

We use envoy to provide TLS termination functionality to services. It's installed as a sidecar in each pod and functions as a reverse proxy to the app. We intend the use it at some point as an initiation point for TLS as well, but that's down the road.

Add support to the chart

There is support for TLS already that has been split off from charts and in in common_templates/. Feel free to look at other charts and copy their approach, the basics are:

  • Symlink from the chart to those (helm package will resolve them correctly)
  • Amend the chart to use them
  • Add a values file to .fixtures/ directory, so CI can test the chart with TLS-enabled
  • Define a proper "upstream_timeout" for envoy to use. Current default is 60s
  • Use the most recent image version (https://docker-registry.wikimedia.org/envoy/tags/)
  • Choose a new TCP port.
    • Update Service ports to point that out.
    • Make it configurable so we can change it without messing with the chart

Create and place certificates

  • Patch the helm chart to add the relevant stanzas. Remember to package the chart and reindex before merging your patch
  • Assuming you've guarded the TLS addition, do a noop deployment to verify you didn't change something fundamental
  • For staging deployments, certificates for staging.svc.eqiad.wmnet and staging.svc.codfw.wmnet are provided by default. You may of cause override them if you need to
  • Add the relevant production certificate to puppet's private repo:
    • edit /srv/private/modules/secret/secrets/certificates/certificate.manifests.d/kube_services.certs.yaml and add a stanza for your service. It should closely mimic the existing ones and should at least have the following alt_names:
      • $SERVICE_NAME.discovery.wmnet
      • $SERVICE_NAME.svc.codfw.wmnet
      • $SERVICE_NAME.svc.eqiad.wmnet
      • $SERVICE_NAME-main-tls-service.$NAMESPACE.svc.cluster.local
    • DO NOT SET A PASSWORD. Using a password results in an encrypted key file, which envoyproxy can't use.
    • run cergen cergen -c "$SERVICE_NAME.*" --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d to see if the right certificates would be generated; then run again adding --generate to create the certificate
    • ONLY IF YOU SET A KEY PASSWORD do the following: We need the unencrypted key, create it with openssl ec -in modules/secret/secrets/certificates/$CERT_NAME/$CERT_NAME.key.private.pem -out modules/secret/secrets/certificates/$CERT_NAME/$CERT_NAME.key.private.unencrypted.pem. You will be required a password (that you set up in cergen)
    • Commit all the generated files to git
    • edit /srv/private/hieradata/role/common/deployment_server/kubernetes.yaml to add it to the appropriate place there, for all production environments:
            tls: &blubberoid_certs
                # NOTE: If you set a password, use the $CERT_NAME.key.private.unencrypted.pem file you created instead.
                key: "secret(certificates/$CERT_NAME/$CERT_NAME.key.private.pem)"
                cert: "secret(certificates/$CERT_NAME/$CERT_NAME.crt.pem)"
            tls: *blubberoid_certs
    • commit all your changes
  • Run puppet on the deployment hosts, verify the data that gets written to the /etc/helmfile-defaults/private/$SERVICE_NAME/{staging,eqiad,codfw}.yaml is correct
  • Add the rest of the configuration for tls enablement in deployment-charts under helmfile.d/services/$SERVICE_NAME/values*.yaml
  • Happy helming!

Deploy the new chart version that has TLS support

helmfile sync/apply in all of the cluster (staging, codwf, eqiad) should cover this. Documentation could use some love but we have Deployments on kubernetes already.

Enable the TLS support

Add a gerrit change to switch tls.enabled to true, perhaps by cluster and turn it on.

Create a new LVS service for TLS enabled service

Follow LVS#Add_a_new_load_balanced_service to create a new LVS service on your newly chosen port, but on the same LVS IP as the previous one.

Switch traffic, aka switch configuration of dependent services to use the new LVS service

Things that might need to be changed:

Things to be mindful of:

  • CPU and memory limits of the envoy sidecar container when more traffic starts hitting the new LVS service.

Remove the old LVS service

For this we use the inverse process than the creation of the new LVS service. There is a runbook already at LVS#Remove_a_load_balanced_service

Things to be mindful of:

  • Make sure that no traffic goes to the old service
  • Alerts are scheduled downtime in icinga

Decommission the non-TLS service from helm chart

The non-TLS service (template) may now be removed from the helm chart es well (freeing a nodePort).