Jump to content

Acme-chief

From Wikitech

Acme-chief is an application resulting from the Wikimedia Hackathon 2018 that is to be used to centrally request configured TLS certificates from ACME servers, then make the public and private parts available to authorised API users.

See T235252 for how to set this up for a Cloud VPS project - particularly the service account creation subtask which needs to be performed by the cloud administrators.

In production this is already set up to manage production DNS, most people probably just want to know to find the certificate configuration in the hieradata/role/common/acme_chief.yaml file in operations/puppet.git.

acme-chief's certificates listen in its hieradata shared_acme_certificates blocks are grouped into a maximum of 40 domains per certificate. Splitting these many domains into smaller, more numerous certificates prevents the x509 certificates from growing to a size detrimental to our performance: This avoids potential extra round-trip delays due to overly-large certificate SANs.

Monitoring

If acme-chief is having issues, you should also check the Let's Encrypt status page to make sure it isn't having an outage or maintenance.

Production environment

Acme-chief production environment is composed of one active instance and at least one passive instance. The hiera key profile::acme_chief::active flags an instance an active while passive instances are listed in an array called profile::acme_chief::passive.

The active instance is responsible of running both the acme-chief service (acme-chief.service) and the puppet file API service (uwsgi-acme-chief.service and nginx.service). A passive instance idly runs the puppet file API service but not acme-chief.service. TLS material is synchronized between instances using the oneshot systemd service acme-chief-certs-sync.service. This service is triggered by a systemd timer every 30 minutes on the active host.

Replacing the active instance

  1. Create the new Ganeti VM.
  2. Set its role as acme_chief on site.pp and make sure that's listed as a passive instance on profile::acme_chief::passive
  3. Once the new instance is up and runnning arm keyholder for SSH access
  4. Run run-puppet-agent on the current active instance
  5. Trigger the TLS material sync service on the active instance — either directly on the host or via Cumin:
    $ sudo systemctl start acme-chief-certs-sync.service
  6. The new instance should now have a current copy of the TLS material in /var/lib/acme-chief/certs. You can verify any of the certificates. For example:
    $ openssl x509 -noout -dates -issuer -in /var/lib/acme-chief/certs/unified/live/ec-prime256v1.crt
  7. Disable puppet on the old active instance and on every acme-chief client
    $ sudo -i cumin 'R:acme_chief::cert' "disable-puppet 'acmechief maintenance'"
  8. Stop acme-chief.service on the old active instance
  9. Set the new instance as active in profile::acme_chief::active and remove it from profile::acme_chief::passive
  10. Execute run-puppet-agent on the new active instance. After Puppet is done acme-chief.service should be up and running in the new instance.
  11. Re-enable puppet on the acme-chief clients:
    $ sudo -i cumin 'R:acme_chief::cert' "enable-puppet 'acmechief maintenance'"
  12. Decommission the old instance using the sre.hosts.decommission cookbook

See also