Puppet CA replacement
Appearance
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
This page was last updated in 2015 and may be outdated. Please update it if you can.
Usage of salt and specific hostnames are no longer correct.
Goal
To replace this CA keypair & cert before it expires, WITHOUT breaking Puppet, Bacula, or IPsec:
subject= /CN=sockpuppet.pmtpa.wmnet issuer= /CN=sockpuppet.pmtpa.wmnet notAfter=May 7 17:57:30 2016 GMT
- clients: /var/lib/puppet/ssl/certs/ca.pem
- servers: /var/lib/puppet/server/ssl/certs/ca.pem
Current state
In Production, Puppet master nodes use two sets of keys+certs:
/var/lib/puppet/server/ssl/
:- served by the frontend load balancer
- served by the puppet master application
- this is what agents see when they connect to the puppetmaster
- same files on palladium + strontium
/var/lib/puppet/ssl/
:- client cert for the local puppet agent
- also served by the the Apache backend which proxies to the application
- this is what the frontend sees when it connects to the backend
- unique files on palladium vs. strontium
Puppet Agent->Master connection sequence:
- Clients connect to hostname "puppet", as given by
server = puppet
in the[agent]
section of/etc/puppet/puppet.conf
, on the default port: 8140. - DNS redirects these connection requests to Palladium via CNAME:
puppet.eqiad.wmnet is an alias for palladium.eqiad.wmnet.
puppet.esams.wmnet is an alias for palladium.eqiad.wmnet.
puppet.codfw.wmnet is an alias for palladium.eqiad.wmnet.
puppet.ulsfo.wmnet is an alias for palladium.eqiad.wmnet.
- Apache listens for HTTPS connections on palladium.eqiad.wmnet:8140 using the certs in
/var/lib/puppet/server/ssl/
- Configured by
/etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf
- Request headers are set to pass the client certificate authentication information on to the puppet master process[1]
- Requests for certificates are proxied to palladium:8141[2] (example:
curl -k https://puppet:8140/production/certificate/cp3030.esams.wmnet
) - Reports PUT by clients are centralized by proxying to palladium:8141[3]
- File bucket requests, and "volatile" content requests should[citation needed] only be in one place, so these are also proxied to palladium:8141.
- All other requests are load-balanced between palladium:8141 and strontium:8141
- Configured by
- For the backend, Apache listens for HTTPS connections on palladium.eqiad.wmnet:8141 and strontium.eqiad.wmnet:8141 using the certs in
/var/lib/puppet/ssl/
- Configured by
/etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf
- The puppet master itself is a Rack application run inside Apache by the Passenger module[4]
- Configured by
- The puppet master application uses the certs in
/var/lib/puppet/server/ssl/
, as configured in the[master]
section of/etc/puppet/puppet.conf
Bacula uses the client keypairs for TLS, and also for on-disk encryption:
- In bacula::client, exec resource "concat-bacula-keypair" copies cert + private key into bacula-keypair-${::fqdn}.pem
- Does not auto-update when the source files change, so manually move the destination file aside to trigger regeneration.
- In bacula::director, exec resource "bacula_cp_private_key" copies private key into bacula-${::fqdn}.pem
- Does not auto-update when the source files change, so manually move the destination file aside to trigger regeneration.
Strongswan uses the client keypairs for IPsec transport encryption:
- Strongswan requires its keys to be in /etc/ipsec.d/, and will not accept symlinks from /var/lib/puppet/ssl/, therefore the keys + certs are defined as file resources in the "strongswan" class.
- Files are auto-updated and the service is notified when the source files change. This will interrupt established transits.
Etcd uses the client keypairs for SSL:
- In etcd::ssl, keys + certs are defined as file resources
- Files are auto-updated and the service is notified when the source files change. This will interrupt etcd communication.
Procedure
All commands are to be run from Palladium, except as noted.
- Announce maintenance via email
- Announce & !log maintenance on IRC
- Schedule downtime in Icinga:
- https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=puppet+last+run
- It seems impossible to do this with a single command, as selecting all services only results in servers starting with A-M displayed in the resulting command screen. Instead, schedule downtime in multiple chunks.
- Make note of any nodes which already have maintenance for this service scheduled, as they will need to be re-set later.
- Stop puppet agents:
sudo salt -b 200 '*' cmd.run 'puppet agent --disable "puppet CA cert replacement"' >> /home/gage/puppet-ca-replacement.log
- Depool nodes running Strongswan (cp3030, cp1065)
- Copy the list of currently queued ("Scheduled") jobs in the Bacula Director so that they can be re-queued later
sudo salt 'helium.eqiad.wmnet' cmd.run "echo status director | sudo bconsole | grep Backup | awk '{print \$6}' > bacula-job-queue.`date +%Y-%m-%d_%T`" >> /home/gage/puppet-ca-replacement.log
- Stop Bacula service components
- Director:
sudo salt 'helium.eqiad.wmnet' cmd.run 'service bacula-director stop' >> /home/gage/puppet-ca-replacement.log
- Storage Daemons:
sudo salt -L 'helium.eqiad.wmnet,heze.codfw.wmnet' cmd.run 'service bacula-sd stop' >> /home/gage/puppet-ca-replacement.log
- File Daemons:
sudo salt -L 'antimony.wikimedia.org,bast1001.wikimedia.org,caesium.eqiad.wmnet,carbon.wikimedia.org,dbstore1001.eqiad.wmnet,gallium.wikimedia.org,helium.eqiad.wmnet,iron.wikimedia.org,lithium.eqiad.wmnet,mira.codfw.wmnet,palladium.eqiad.wmnet,silver.wikimedia.org,sodium.wikimedia.org,stat1002.eqiad.wmnet,stat1003.eqiad.wmnet,mwmaint1001.eqiad.wmnet,deploy1001.eqiad.wmnet,titanium.wikimedia.org,uranium.wikimedia.org,ytterbium.wikimedia.org' cmd.run 'service bacula-fd stop' >> /home/gage/puppet-ca-replacement.log
- Director:
- Back up Bacula keys:
sudo salt -L 'antimony.wikimedia.org,bast1001.wikimedia.org,caesium.eqiad.wmnet,carbon.wikimedia.org,dbstore1001.eqiad.wmnet,gallium.wikimedia.org,helium.eqiad.wmnet,iron.wikimedia.org,lithium.eqiad.wmnet,mira.codfw.wmnet,palladium.eqiad.wmnet,silver.wikimedia.org,sodium.wikimedia.org,stat1002.eqiad.wmnet,stat1003.eqiad.wmnet,mwmaint1001.eqiad.wmnet,deploy1001.eqiad.wmnet,titanium.wikimedia.org,uranium.wikimedia.org,ytterbium.wikimedia.org' cmd.run 'sudo mv -v /var/lib/puppet/ssl/private_keys/bacula-keypair-`hostname -f`.pem /var/lib/puppet/ssl/private_keys/bacula-keypair-`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
sudo salt 'helium.eqiad.wmnet' cmd.run 'sudo mv -v /var/lib/puppet/ssl/private_keys/bacula-`hostname -f`.pem /var/lib/puppet/ssl/private_keys/bacula-`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
- Ensure a few minutes have passed to allow the last agent runs to complete and PUT their reports to the puppetmaster. The puppetmaster frontend should now be idle. Confirm:
sudo tail -f /var/log/apache2/puppetmaster.log &
- Keep this running, we'll use it later to confirm restart
- Stop the puppetmaster Apache 'site' on Palladium, but leave Apache running for PyBal:
sudo a2dissite 50-puppetmaster-wikimedia-org.conf && sudo service apache2 reload
- Confirm it's no longer listening on 8040 or 8041:
sudo lsof -i -n -P | grep apache | grep LISTEN | awk '{print $9}' | sort -u
- Temporarily create an autosigning whitelist[5] for all currently-signed hosts so that their new certs may be accepted:
sudo ls /var/lib/puppet/server/ssl/ca/signed | sed 's/.pem//' | sudo tee /etc/puppet/autosign.conf
- Back up puppet server SSL dirs:
sudo salt -L 'palladium.eqiad.wmnet,strontium.eqiad.wmnet' cmd.run 'mv -v /var/lib/puppet/server/ssl /var/lib/puppet/server/ssl.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
- Preserve
inventory.txt
on palladium because it is a historical record (optional):sudo mkdir -p /var/lib/puppet/server/ssl/ca && sudo cp /var/lib/puppet/server/ssl.*/ca/inventory.txt /var/lib/puppet/server/ssl/ca/
- Prep puppet agent SSL dirs: keep RSA keypair and back up agent cert but remove CA cert, CRL and CSR:
sudo salt -b 200 '*' cmd.run 'for i in certificate_requests/`hostname -f`.pem certs/ca.pem crl.pem ; do rm /var/lib/puppet/ssl/$i; done; mv -v /var/lib/puppet/ssl/certs/`hostname -f`.pem /var/lib/puppet/ssl/certs/`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
- Sanity check: verify correct clock/date on palladium before proceeding with next step
- Regenerate the puppet CA keys & cert on palladium:
sudo puppet cert --generate `hostname -f`
- Creates CA keypair + cert, server keypair + cert, and associated files (CA key passphrase, serial, CRL, inventory.txt)[6] in
server/ssl/
- Copy
server/ssl/
from palladium to strontium using a tar pipe:- From admin's laptop:
ssh palladium.eqiad.wmnet 'sudo tar cvf - /var/lib/puppet/server/ssl' | ssh strontium.eqiad.wmnet 'sudo tar xvf - -C /'
- From admin's laptop:
- Fix puppet.conf on palladium + strontium by manually editing the file AND merging a change in the Git repo to remove these lines:
certname = puppet
hostcert = /var/lib/puppet/server/ssl/certs/palladium.eqiad.wmnet.pem
hostprivkey = /var/lib/puppet/server/ssl/private_keys/palladium.eqiad.wmnet.pem
- This change is needed because prior to this procedure, the
hostcert
andhostprivkey
settings were invalid and ignored, and the cert is called puppet.pem (Subject: CN=puppet
). With the newly generated server cert, the CN is the FQDN, and the "puppet" service name is supported byDNS:puppet
in theX509v3 Subject Alternative Name
field of the cert.hostcert
andhostprivkey
are now correct but no longer need to be explicitly set because they use their default values.
- This change is needed because prior to this procedure, the
- The Apache backend on port 8141 uses the puppet agent certs for SSL, so puppet agent must be run on these nodes before we can start the Apache puppetmaster. To bootstrap past the circular dependency of needing to run the agent before the master is started, temporarily enable and start the standalone (webrick) puppetmaster, run the agents, then shut it down and disable it again:
sudo sed -i 's/START=no/START=yes/' /etc/default/puppetmaster && sudo service puppetmaster start
sudo salt -L 'palladium.eqiad.wmnet,strontium.eqiad.wmnet' cmd.run 'puppet agent --enable && puppet agent --onetime --no-daemonize' >> /home/gage/puppet-ca-replacement.log
sudo sed -i 's/START=yes/START=no/' /etc/default/puppetmaster && sudo service puppetmaster stop
- Now Apache has all the certs it needs to start up the real Passenger-based puppetmaster.
- Restart Apache puppetmaster 'site' & confirm success
- Ensure we're still tailing Apache's log from step 9
jobs
sudo a2ensite 50-puppetmaster-wikimedia-org.conf && sudo service apache2 reload
- Confirm it's listening on 8040 and 8041:
sudo lsof -i -n -P | grep apache | grep LISTEN | awk '{print $9}' | sort -u
- Ensure we're still tailing Apache's log from step 9
- Sanity check: run puppet agent & observe result on cp1065:
- From admin's laptop:
ssh cp1065.eqiad.wmnet "sudo puppet agent -tv"
- From admin's laptop:
- Enable puppet agents for all other servers
sudo salt -b 200 '*' cmd.run 'puppet agent --enable' >> /home/gage/puppet-ca-replacement.log
- Wait ~23 minutes for agent to run on all nodes
sleep 1380 && echo "ready to continue"
- Confirm that agents have run by counting auto-signed certs, then remove autosign.conf:
sudo wc -l /etc/puppet/autosign.conf
sudo ls /var/lib/puppet/server/ssl/ca/signed | wc -l
sudo rm /etc/puppet/autosign.conf
- The next agent invocation will:
- Copy SSL certs + keys to
/etc/ipsec.d/
for Strongswan, and restarts services. - Create copies of SSL certs + keys for Bacula, and restarts services.
- On Helium:
bacula-`hostame -f`.pem
- On all hosts:
bacula-keypair-`hostname -f`.pem
- On Helium:
- Copy SSL certs + keys to
- Confirm:
- Successful agent runs
- Successful Bacula operation
- Successful IPsec transport establishment:
ipsec status
- Re-queue jobs in Bacula: TODO based on file created above
- Re-pool IPsec nodes cp3030 & cp1065
- Re-set maintenance for any puppet agent checks in Icinga already set before maintenance began
- Note any hosts which were offline during this maintenance, as their puppet agents will need to be updated
- Check Icinga for any remaining issues
Caveats
- On puppetmasters, we set
ssldir
twice: in[main]
we set it to/var/lib/puppet/ssl/
, in[master]
we override it to/var/lib/puppet/server/ssl/
. That's ok, it's just complicated. Howeverpuppet config print
gets confused by this, sosudo puppet config print ssldir
outputs/var/lib/puppet/ssl/
instead of/var/lib/puppet/server/ssl/
. Newer versions of puppet support a--section master
argument, but that's not available in 3.4.3.[7] Instead we can usesudo puppet master --configprint ssldir
. /var/lib/puppet/client/ssl/
is referenced only inmodules/puppet/manifests/self/config.pp
, and not found anywhere in production. It's only used for self-hosted puppetmasters in labs, and per the comments in the manifest source this directory is specified asssldir
only to avoid conflicts with previously generated puppet certificates from the normal puppet setup.- Prior to this procedure, files specified in the
[master]
section don't actually exist:gage@palladium:~$ grep .pem /etc/puppet/puppet.conf | cut -d= -f2 | xargs sudo ls -ld
- ls: cannot access /var/lib/puppet/server/ssl/certs/palladium.eqiad.wmnet.pem: No such file or directory
ls: cannot access /var/lib/puppet/server/ssl/private_keys/palladium.eqiad.wmnet.pem: No such file or directory
- Instead it loads
/var/lib/puppet/server/ssl/{certs,private_keys}/puppet.pem
based onssldir = /var/lib/puppet/server/ssl/
andcertname = puppet
Q&A
- Shall we regenerate the CA keypair or simply regenerate the cert?
- Puppet 3.1+ uses RSA 4096 + SHA256 for 'puppet cert generate', whereas older versions and our current keypair use RSA 1024 + SHA1. Regenerating the cert from the existing key will only address the validity dates, it won't fix the key size. Therefore we should regen a new keypair.
- Are there issues with supporting this new keypair on our remaining Lucid nodes?
- Our only remaining node <12.04 is Sodium. Its version of OpenSSL, 0.9.8k-7ubuntu8.23, supports sha256 (
echo test | openssl dgst -sha256
), therefore no problems are anticipated.
- Our only remaining node <12.04 is Sodium. Its version of OpenSSL, 0.9.8k-7ubuntu8.23, supports sha256 (
- What happens to in-flight puppet agent catalog runs when the agent is disabled?
- Nothing. The run completes as normal, and the request to disable the agent completes without errors and affects only subsequent agent runs. Just what we want.
- How shall we handle the fact that Bacula's Director forgets its queue of scheduled jobs when it is restarted?
- Will PyBal be affected by this cert change?
- No. The URLs in /etc/pybal/pybal.conf are HTTP, not HTTPS.
- Shall the CA cert be the self-signed one generated by puppet, generated from a WMF internal CA, or chained from one of our commercial certs?
- To use anything but a self-signed cert, we would have to use puppet's external CA support, which means: "Puppet cannot automatically distribute certificates in these configurations — you must have your own complete system for issuing and distributing certificates."[8] So we will continue using the standard self-signed CA cert.
- How do we determine what hash function is used on the certs?
- We can't, it's hardcoded to SHA256.
- Are the puppet CA cert or key stored in our private git repo?
- No, but they probably should be. TODO
- Why does palladium:/var/lib/puppet/server/ssl/certs/ contain the certs for 36 random hosts from March 2 2015?
- I don't know, seems like user error of some sort. /var/lib/puppet/server/ssl/private_keys/acamar.wikimedia.org.pem shouldn't be there either. (!) But it doesn't matter because this procedure removes that mess.
- How will Labs be affected by this change?
- Should the CA host be the puppetmaster (palladium) or another host? Is there a canonical host we use for signing other keys?
- How will IPsec transports behave during this procedure?
- Expected: they will fail during the time when one agent has run and there is a resulting CA cert mismatch. Therefore we will depool nodes using IPsec before replacing certs.
- Will existing Bacula backups be restoreable after the procedure is complete?
- Yes, because we will back up host certs and original CA.
- This procedure is made more complex because Puppet's SSL certs are reused for other applications which require SSL: Bacula and Strongswan. Is there a puppet module which will make key management less painful so that we can maintain independent keys for these services?
- Maybe. Based on https://forge.puppetlabs.com/tags/ssl and https://forge.puppetlabs.com/tags/openssl, the "Approved" solution is this module, but it's unclear whether it would meet our needs: https://github.com/camptocamp/puppet-openssl
- How does LDAP use certs?
/etc/ldap/ldap.conf
references/etc/ssl/certs/ca-certificates.crt
, which is independent of Puppet's certs.
- How does labs use certs?
- The labs puppetmaster node, labcontrol1001.wikimedia.org, is a normal client of the prod puppetmaster, with agent certs in
/var/lib/puppet/ssl/
. The puppetmaster certs in/var/lib/puppet/server/ssl/
are independent and will not be affected by this change.
- The labs puppetmaster node, labcontrol1001.wikimedia.org, is a normal client of the prod puppetmaster, with agent certs in
- Do the website certs in
/etc/ssl/
have any relationship with the puppet certs?- No.
References
- https://blkperl.github.io/replace-puppet-ca.html
- http://www.masterzen.fr/2010/11/14/puppet-ssl-explained/
- http://docs.puppetlabs.com/puppet/3.5/reference/ssl_regenerate_certificates.html
Citations
- ↑ https://docs.puppetlabs.com/guides/passenger.html#notes-on-ssl-verification
- ↑ https://docs.puppetlabs.com/guides/scaling_multiple_masters.html#option-2-proxy-certificate-traffic
- ↑ https://docs.puppetlabs.com/guides/scaling_multiple_masters.html#centralize-reports-inventory-service-and-catalog-searching-storeconfigs
- ↑ https://docs.puppetlabs.com/guides/passenger.html#create-and-enable-the-puppet-master-vhost
- ↑ http://docs.puppetlabs.com/puppet/3/reference/ssl_autosign.html#basic-autosigning-autosignconf
- ↑ https://docs.puppetlabs.com/puppet/3.8/reference/dirs_ssldir.html
- ↑ https://docs.puppetlabs.com/puppet/latest/reference/config_print.html
- ↑ http://docs.puppetlabs.com/puppet/3/reference/config_ssl_external_ca.html