Acme-chief/Cloud VPS setup

From Wikitech

This doc is based off https://phabricator.wikimedia.org/T235252#5567838

Introduction

Acme-chief is Wikimedia's tool to integrate Let's Encrypt certificates into our puppetised services. It was originally developed by Alex Monk at the Wikimedia Hackathon in Barcelona in May 2018 based on discussions with the Wikimedia Traffic team, and basically solves these problems:

  • Issue our certificates centrally, distributing private key material and certificates as appropriate.
  • Use DNS to respond to LE challenges, enabling use of wildcards.
    • Integrate with gdnsd (production) and OpenStack Designate (Cloud VPS) to do this.
  • Generate RSA and ECDSA variants of the same certificates.
  • Expose new certificates after a staging time to allow for outdated client clocks.
  • Probably other things more relevant to production than us.

In production there's a single, simple, central setup that gets used for everything from the *.wikipedia.org 'unified' cert exposed to some clients hitting wikipedia.org and co., down to miscellaneous such as developer services on gerrit and SMTP servers. In Cloud VPS with our only-trustworthy-for-public-things central puppetmaster and multiple private Puppet setups, it gets more complicated.

Existing examples of acme-chief in Cloud VPS

This is known to be active within deployment-prep, traffic, and tools (with toolsbeta being set up).

Setting it up for your own Cloud VPS project

You will need

  • a project-local puppetmaster (i.e. your own puppetmaster, not just relying on the central one at puppetmaster.cloudinfra.wmflabs.org)
  • a friendly production root to give your DNS management service user special permissions in Keystone and safelist it for access from within the Cloud VPS address range
  • the domain for which you want to issue certs as a zone in Designate, with delegation set up correctly

Full steps

  • Create an instance named <project>-acme-chief-01 and do the usual dance with puppet to get them signed puppet certs.
  • Create a puppet prefix config in Horizon for <project>-acme-chief with the following hiera as a template (obviously, substitute <project>):
profile::acme_chief::accounts: {}
profile::acme_chief::active: <project>-acme-chief-01.<project>.eqiad1.wikimedia.cloud
profile::acme_chief::passive: ''
profile::acme_chief::certificates: {}
shared_acme_certificates: {}
profile::acme_chief::challenges:
  dns-01:
    issuing_ca: letsencrypt.org
    ns_records:
    - ns0.openstack.eqiad1.wikimediacloud.org.
    - ns1.openstack.eqiad1.wikimediacloud.org.
    resolver_port: 53
    sync_dns_servers:
    - ignored_for_designate.
    zone_update_cmd: /usr/local/bin/acme-chief-designate-sync.py
profile::acme_chief::cloud::designate_sync_auth_url: https://openstack.eqiad1.wikimediacloud.org:25000/v3
profile::acme_chief::cloud::designate_sync_project_names: [<project>]
profile::acme_chief::cloud::designate_sync_region_name: eqiad1-r
profile::acme_chief::cloud::designate_sync_tidyup_enabled: true
profile::acme_chief::cloud::designate_sync_username: <project>-dns-manager
  • Insert <project>-dns-manager password into puppet through a cherry-pick on your puppetmaster by adding profile::acme_chief::cloud::designate_sync_password to hieradata/common.yaml in labs/private.
  • Apply the role::acme_chief::cloud role on the instance individually (in my experience roles in prefix/project config can be problematic) and run puppet.
  • Run the account creation script /usr/local/bin/create_acme_le_account.py
  • Insert the into profile::acme_chief::accounts dict into hiera. It should look something like this:
profile::acme_chief::accounts:
  {hash}:
    directory: https://acme-v02.api.letsencrypt.org/directory
    regr: '{"body": {}, "uri": "https://acme-v02.api.letsencrypt.org/acme/acct/{number}"}'

Getting the hash from the account ID above and the number from the regr.json above. Be careful not to include the .body.key part of the regr.json.

  • Insert the regr.json and private_key.pem into the specified locations in cherry-picks on your puppetmaster.
  • Add your cert to the certificate dict in hiera:
  mycertificate:
    CN: wikipedia.org
    SNI:
    - wikipedia.org
    - '*.wikipedia.org'
    authorized_regexes:
    - ^cp-[0-9]+\.myproject\.eqiad1\.wikimedia\.cloud$
    challenge: dns-01
  • Set acmechief_host: myproject-acme-chief-01.myproject.eqiad1.wikimedia.cloud on a project-wide basis (or at least on the instances which will be pulling certs from it)

You should now be able to use the acme_chief::cert resource on your TLS termination box(es) to get a certificate, with a name matching what you have in the hiera config.

Summary

  • if the project doesn't already have one, set up a DNS management user in Keystone with observer and designateadmin permissions. More info about that at Service accounts
  • make acme-chief node(s)
  • add some hiera
  • add designateadmin user's pass in a secret
  • add the acme-chief role to your instance(s)
  • do LE account creation, commit results as a secret
  • start using it - configure a cert and point other instances in the project at your acme-chief instance

Troubleshooting

  • If a cert has strangely expired, you may have hit a known issue in acme chief where it doesn't respond to HUP quite right. Restarting acme-chief should work. Otherwise, see Acme-chief#Monitoring
  • If client hosts complain about 'unable to get local issuer certificate' you may need to restart nginx on the acme-chief host, or restart the puppetserver.