HTTPS/Unified Certificates

From Wikitech
Jump to navigation Jump to search

Wikimedia "Unified" Certificates


These are the primary multi-wildcard-SAN certificates used on the front edge of our Traffic clusters. They have a number of unique properties operationally:

  1. Highly important - these certificates terminate the bulk of all of our important live user-facing traffic.
  2. High SAN counts + Wildcards - We have all 13x canonical domains in these certs as SANs, wildcarded at the domain level and the m-dot level, as well as a few other odds and ends. All total the current SAN count is 29, and most of those are wildcards.
  3. Broad deployment - These certs deploy to all Traffic edge nodes in all datacenters, so deployment/synchronization issues are a little trickier than smaller services with one to a handful of hosts.
  4. Redundancy - Because we use OCSP Stapling which relies on the upstream certificate providers' OCSP infrastructure reliability in near-realtime, we purchase and deploy redundant copies of these certificates from two different vendors, plus also from LetsEncrypt.

Certificate Vendor Deployment and Switching on Failure

We have had upstream OCSP failures affect us in the past: Incident_documentation/20150820-OCSP Incident_documentation/20161013-GlobalSign. Our plan for future OCSP incidents is to switch all datacenters to whichever vendor's certificates are not having OCSP issues.

Our current vendors are GlobalSign and Digicert, plus also LetsEncrypt. Our standard deployment of these today is to use the Digicert certificates in our non-US datacenters and LetsEncrypt in the US datacenters, so that both are known-good by servicing live user traffic. All of the certificate vendors are deployed to all hosts at all datacenters, and OCSP staple-fetching occurs for them all from all hosts at all times as well. Switching vendors is just a matter of proxy reconfiguration.

The vendor used in each DC is set via hieradata:

   
   willikins:puppet vgutierrez$ git grep public_tls_unified_cert_vendor
   hieradata/codfw.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
   hieradata/eqiad.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
   hieradata/eqsin.yaml:public_tls_unified_cert_vendor: "digicert-2019a"
   hieradata/esams.yaml:public_tls_unified_cert_vendor: "digicert-2019a"
   hieradata/ulsfo.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
   

To switch in an emergency:

  1. Merge a puppet commit changing all of the above hieradata settings to reference the remaining functional vendor.
  2. Run the puppet agent all cacheproxy hosts via cumin, e.g. sudo cumin A:cp 'run-puppet-agent -q'

Sometimes, OCSP staleness alerts are firing due to a now-resolved issue with the certificate vendor's infrastructure. In this case you can manually trigger an OCSP refresh with: sudo -i cumin -b1 'A:cp-eqiad' "/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all"