HTTPS/Unified Certificates

From Wikitech

These are the primary multi-wildcard-SAN certificates that not only serve our Traffic clusters but also serve other Wikimedia functions such as Fundraising. They have a number of unique properties operationally:

  1. Highly important - these certificates terminate the bulk of all of our important live user-facing traffic.
  2. High SAN counts + Wildcards - We have all canonical domains in these certs as SANs, wildcarded at the domain level and the m-dot level, as well as a few other odds and ends. All total the current SAN count is 29, and most of those are wildcards.
  3. Broad deployment - These certs deploy to all Traffic edge nodes in all datacenters, so deployment/synchronization issues are a little trickier than smaller services with one to a handful of hosts.
  4. Redundancy - Because we use OCSP Stapling which relies on the upstream certificate providers' OCSP infrastructure reliability in near-realtime, we purchase and deploy redundant copies of these certificates from two different vendors, plus also from LetsEncrypt.

Certificate Vendor Deployment and Switching on Failure

We have had upstream OCSP failures affect us in the past: Incident_documentation/20150820-OCSP Incident_documentation/20161013-GlobalSign. Our plan for future OCSP incidents is to switch all datacenters to whichever vendor's certificates are not having OCSP issues.

Our current vendors are Digicert and LetsEncrypt. Our standard deployment of these today is to use the Digicert certificates in our non-US datacenters and LetsEncrypt in the US datacenters, so that both are known-good by servicing live user traffic. All of the certificate vendors are deployed to the filesystems of all edge hosts at all datacenters, and OCSP staple-fetching occurs for them all from all hosts at all times as well. Switching which certificate is in active use at a given edge datacenter is just a matter of proxy reconfiguration driven by hieradata:

$ git grep public_tls_unified_cert_vendor
hieradata/codfw.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
hieradata/drmrs.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/eqiad.yaml:public_tls_unified_cert_vendor: "lets-encrypt"
hieradata/eqsin.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/esams.yaml:public_tls_unified_cert_vendor: "digicert-2022"
hieradata/ulsfo.yaml:public_tls_unified_cert_vendor: "lets-encrypt"

To switch in an emergency:

  1. Merge a puppet commit changing all of the above hieradata settings to reference the remaining functional vendor.
  2. Run the puppet agent all cacheproxy hosts via cumin, e.g. sudo cumin A:cp 'run-puppet-agent -q'

Sometimes, OCSP staleness alerts are firing due to a now-resolved issue with the certificate vendor's infrastructure. In this case with a manually-issued vendor such as Digicert, you can manually trigger an OCSP refresh with: sudo -i cumin -b1 'A:cp-eqiad' "/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all"

For LetsEncrypt certificate OCSP issues, see Acme-chief documentation

Validation

Wikimedia's domains must to be validated by the issuing certificate authority before they will issue a unified certificate. Presently, Wikimedia uses email-based verification.

A bad actor can impersonate WMF and just-as-easily alter the TXT records as they could redirect the email should they gain control of Wikimedia's DNS. Therefore, email verification isn't particularly harmful in this case. Future use of TXT records will be implemented not for security but for convenience of the renewal process.

The following notes are not to be intended as a full procedure. Better, consider them a sort of generic reminder for some of the required steps. A lot of the following instructions can change, depending on the DigiCert website and new certificate types

.

To validate the domains for the unified certificate via email:

  1. Notify the appropriate teams in the appropriate channels of the impending verification emails that will be sent.
  2. Follow the official documentation for verifying emails, noting:
    • Not all DCV administrative email addresses that are suggested should be used (e.g. admin@, webmaster@). Use hostmaster@.
    • It's possible that a new domain has not been set up for email routing. If that's the case, either create a patch setting up email routing or create a patch setting the validation TXT record, verify, then revert.
  3. Once the domains have been validated, renew the certificate using the official documentation as a guideline, noting:
    • The CSR will include the CN but not any of the SANs. The SANs will be added automatically via the web interface, wihch is pre-filled.
    • Use the puppet master server to generate a CSR using the existing domain keys that live under /srv/private/modules/secret/secrets/ssl:
      # openssl req -new -key <current_key>.key -out server.csr
Older keys are kept around in the private repository in case we need to revoke the certificates.

See also