We use envoy to provide TLS termination functionality to services. It's installed as a sidecar in each pod and functions as a reverse proxy to the app. We intend the use it at some point as an initiation point for TLS as well, but that's down the road.
Add support to the chart
There is support for TLS already that has been split off from charts and in in common_templates/. Feel free to look at other charts and copy their approach, the basics are:
- Symlink from the chart to those (helm package will resolve them correctly)
- Amend the chart to use them
- Add a values file to .fixtures/ directory, so CI can test the chart with TLS-enabled
- Define a proper "upstream_timeout" for envoy to use. Current default is 60s
- Use the most recent image version (https://tools.wmflabs.org/dockerregistry/envoy/tags/)
- Choose a new TCP port.
- Update Service ports to point that out.
- Make it configurable so we can change it without messing with the chart
Create and place certificates
- Patch the helm chart to add the relevant stanzas. Remember to package the chart and reindex before merging your patch
- Assuming you've guarded the TLS addition, do a noop deployment to verify you didn't change something fundamental
- For staging deployments, certificates for staging.svc.eqiad.wmnet and staging.svc.codfw.wmnet are provided by default. You may of cause override them if you need to
- Add the relevant production certificate to puppet's private repo:
/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/kube_services.certs.yamland add a stanza for your service. It should closely mimic the existing ones and should at least have the following alt_names:
- DO NOT SET A PASSWORD. Using a password results in an encrypted key file, which envoyproxy can't use.
- run cergen
cergen -c "$SERVICE_NAME.*" --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.dto see if the right certificates would be generated; then run again adding
--generateto create the certificate
- ONLY IF YOU SET A KEY PASSWORD do the following: We need the unencrypted key, create it with
openssl ec -in modules/secret/secrets/certificates/$CERT_NAME/$CERT_NAME.key.private.pem -out modules/secret/secrets/certificates/$CERT_NAME/$CERT_NAME.key.private.unencrypted.pem. You will be required a password (that you set up in cergen)
- Commit all the generated files to git
/srv/private/hieradata/role/common/deployment_server/kubernetes.yamlto add it to the appropriate place there, for all production environments:
profile::kubernetes::deployment_server_secrets::services: blubberoid: eqiad: tls: &blubberoid_certs certs: # NOTE: If you set a password, use the $CERT_NAME.key.private.unencrypted.pem file you created instead. key: "secret(certificates/$CERT_NAME/$CERT_NAME.key.private.pem)" cert: "secret(certificates/$CERT_NAME/$CERT_NAME.crt.pem)" codfw: tls: *blubberoid_certs ...
- commit all your changes
- Run puppet on the deployment hosts, verify the data that gets written to the
- Add the rest of the configuration for tls enablement in deployment-charts under
- Happy helming!
Deploy the new chart version that has TLS support
helmfile sync/apply in all of the cluster (staging, codwf, eqiad) should cover this. Documentation could use some love but we have Deployments on kubernetes already.
Enable the TLS support
Add a gerrit change to switch
tls.enabled to true, perhaps by cluster and turn it on.
Create a new LVS service for TLS enabled service
Follow LVS#Add_a_new_load_balanced_service to create a new LVS service on your newly chosen port, but on the same LVS IP as the previous one.
Switch traffic, aka switch configuration of dependent services to use the new LVS service
Things that might need to be changed:
- mediawiki-config https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config
- caching proxies configuration https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml
Things to be mindful of:
- CPU and memory limits of the envoy sidecar container when more traffic starts hitting the new LVS service.
Remove the old LVS service
For this we use the inverse process than the creation of the new LVS service. There is a runbook already at LVS#Remove_a_load_balanced_service
Things to be mindful of:
- Make sure that no traffic goes to the old service
- Alerts are scheduled downtime in icinga
Decommission the non-TLS service from helm chart
The non-TLS service (template) may now be removed from the helm chart es well (freeing a nodePort).