Puppet CA replacement

From Wikitech
This page was last updated in 2015 and may be outdated. Please update it if you can.

Usage of salt and specific hostnames are no longer correct.

Goal

To replace this CA keypair & cert before it expires, WITHOUT breaking Puppet, Bacula, or IPsec:

subject= /CN=sockpuppet.pmtpa.wmnet
issuer= /CN=sockpuppet.pmtpa.wmnet
notAfter=May  7 17:57:30 2016 GMT
  • clients: /var/lib/puppet/ssl/certs/ca.pem
  • servers: /var/lib/puppet/server/ssl/certs/ca.pem

Current state

In Production, Puppet master nodes use two sets of keys+certs:

  • /var/lib/puppet/server/ssl/:
    • served by the frontend load balancer
    • served by the puppet master application
    • this is what agents see when they connect to the puppetmaster
    • same files on palladium + strontium
  • /var/lib/puppet/ssl/:
    • client cert for the local puppet agent
    • also served by the the Apache backend which proxies to the application
    • this is what the frontend sees when it connects to the backend
    • unique files on palladium vs. strontium

Puppet Agent->Master connection sequence:

  1. Clients connect to hostname "puppet", as given by server = puppet in the [agent] section of /etc/puppet/puppet.conf, on the default port: 8140.
  2. DNS redirects these connection requests to Palladium via CNAME:
    • puppet.eqiad.wmnet is an alias for palladium.eqiad.wmnet.
    • puppet.esams.wmnet is an alias for palladium.eqiad.wmnet.
    • puppet.codfw.wmnet is an alias for palladium.eqiad.wmnet.
    • puppet.ulsfo.wmnet is an alias for palladium.eqiad.wmnet.
  3. Apache listens for HTTPS connections on palladium.eqiad.wmnet:8140 using the certs in /var/lib/puppet/server/ssl/
    • Configured by /etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf
    • Request headers are set to pass the client certificate authentication information on to the puppet master process[1]
    • Requests for certificates are proxied to palladium:8141[2] (example: curl -k https://puppet:8140/production/certificate/cp3030.esams.wmnet)
    • Reports PUT by clients are centralized by proxying to palladium:8141[3]
    • File bucket requests, and "volatile" content requests should[citation needed] only be in one place, so these are also proxied to palladium:8141.
    • All other requests are load-balanced between palladium:8141 and strontium:8141
  4. For the backend, Apache listens for HTTPS connections on palladium.eqiad.wmnet:8141 and strontium.eqiad.wmnet:8141 using the certs in /var/lib/puppet/ssl/
    • Configured by /etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf
    • The puppet master itself is a Rack application run inside Apache by the Passenger module[4]
  5. The puppet master application uses the certs in /var/lib/puppet/server/ssl/, as configured in the [master] section of /etc/puppet/puppet.conf

Bacula uses the client keypairs for TLS, and also for on-disk encryption:

  • In bacula::client, exec resource "concat-bacula-keypair" copies cert + private key into bacula-keypair-${::fqdn}.pem
    • Does not auto-update when the source files change, so manually move the destination file aside to trigger regeneration.
  • In bacula::director, exec resource "bacula_cp_private_key" copies private key into bacula-${::fqdn}.pem
    • Does not auto-update when the source files change, so manually move the destination file aside to trigger regeneration.

Strongswan uses the client keypairs for IPsec transport encryption:

  • Strongswan requires its keys to be in /etc/ipsec.d/, and will not accept symlinks from /var/lib/puppet/ssl/, therefore the keys + certs are defined as file resources in the "strongswan" class.
    • Files are auto-updated and the service is notified when the source files change. This will interrupt established transits.

Etcd uses the client keypairs for SSL:

  • In etcd::ssl, keys + certs are defined as file resources
    • Files are auto-updated and the service is notified when the source files change. This will interrupt etcd communication.

Procedure

All commands are to be run from Palladium, except as noted.

  1. Announce maintenance via email
  2. Announce & !log maintenance on IRC
  3. Schedule downtime in Icinga:
  4. Stop puppet agents:
    • sudo salt -b 200 '*' cmd.run 'puppet agent --disable "puppet CA cert replacement"' >> /home/gage/puppet-ca-replacement.log
  5. Depool nodes running Strongswan (cp3030, cp1065)
  6. Copy the list of currently queued ("Scheduled") jobs in the Bacula Director so that they can be re-queued later
    • sudo salt 'helium.eqiad.wmnet' cmd.run "echo status director | sudo bconsole | grep Backup | awk '{print \$6}' > bacula-job-queue.`date +%Y-%m-%d_%T`" >> /home/gage/puppet-ca-replacement.log
  7. Stop Bacula service components
    • Director: sudo salt 'helium.eqiad.wmnet' cmd.run 'service bacula-director stop' >> /home/gage/puppet-ca-replacement.log
    • Storage Daemons: sudo salt -L 'helium.eqiad.wmnet,heze.codfw.wmnet' cmd.run 'service bacula-sd stop' >> /home/gage/puppet-ca-replacement.log
    • File Daemons: sudo salt -L 'antimony.wikimedia.org,bast1001.wikimedia.org,caesium.eqiad.wmnet,carbon.wikimedia.org,dbstore1001.eqiad.wmnet,gallium.wikimedia.org,helium.eqiad.wmnet,iron.wikimedia.org,lithium.eqiad.wmnet,mira.codfw.wmnet,palladium.eqiad.wmnet,silver.wikimedia.org,sodium.wikimedia.org,stat1002.eqiad.wmnet,stat1003.eqiad.wmnet,mwmaint1001.eqiad.wmnet,deploy1001.eqiad.wmnet,titanium.wikimedia.org,uranium.wikimedia.org,ytterbium.wikimedia.org' cmd.run 'service bacula-fd stop' >> /home/gage/puppet-ca-replacement.log
  8. Back up Bacula keys:
    • sudo salt -L 'antimony.wikimedia.org,bast1001.wikimedia.org,caesium.eqiad.wmnet,carbon.wikimedia.org,dbstore1001.eqiad.wmnet,gallium.wikimedia.org,helium.eqiad.wmnet,iron.wikimedia.org,lithium.eqiad.wmnet,mira.codfw.wmnet,palladium.eqiad.wmnet,silver.wikimedia.org,sodium.wikimedia.org,stat1002.eqiad.wmnet,stat1003.eqiad.wmnet,mwmaint1001.eqiad.wmnet,deploy1001.eqiad.wmnet,titanium.wikimedia.org,uranium.wikimedia.org,ytterbium.wikimedia.org' cmd.run 'sudo mv -v /var/lib/puppet/ssl/private_keys/bacula-keypair-`hostname -f`.pem /var/lib/puppet/ssl/private_keys/bacula-keypair-`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
    • sudo salt 'helium.eqiad.wmnet' cmd.run 'sudo mv -v /var/lib/puppet/ssl/private_keys/bacula-`hostname -f`.pem /var/lib/puppet/ssl/private_keys/bacula-`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
  9. Ensure a few minutes have passed to allow the last agent runs to complete and PUT their reports to the puppetmaster. The puppetmaster frontend should now be idle. Confirm:
    • sudo tail -f /var/log/apache2/puppetmaster.log &
    • Keep this running, we'll use it later to confirm restart
  10. Stop the puppetmaster Apache 'site' on Palladium, but leave Apache running for PyBal:
    • sudo a2dissite 50-puppetmaster-wikimedia-org.conf && sudo service apache2 reload
    • Confirm it's no longer listening on 8040 or 8041: sudo lsof -i -n -P | grep apache | grep LISTEN | awk '{print $9}' | sort -u
  11. Temporarily create an autosigning whitelist[5] for all currently-signed hosts so that their new certs may be accepted:
    • sudo ls /var/lib/puppet/server/ssl/ca/signed | sed 's/.pem//' | sudo tee /etc/puppet/autosign.conf
  12. Back up puppet server SSL dirs:
    • sudo salt -L 'palladium.eqiad.wmnet,strontium.eqiad.wmnet' cmd.run 'mv -v /var/lib/puppet/server/ssl /var/lib/puppet/server/ssl.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
  13. Preserve inventory.txt on palladium because it is a historical record (optional):
    • sudo mkdir -p /var/lib/puppet/server/ssl/ca && sudo cp /var/lib/puppet/server/ssl.*/ca/inventory.txt /var/lib/puppet/server/ssl/ca/
  14. Prep puppet agent SSL dirs: keep RSA keypair and back up agent cert but remove CA cert, CRL and CSR:
    • sudo salt -b 200 '*' cmd.run 'for i in certificate_requests/`hostname -f`.pem certs/ca.pem crl.pem ; do rm /var/lib/puppet/ssl/$i; done; mv -v /var/lib/puppet/ssl/certs/`hostname -f`.pem /var/lib/puppet/ssl/certs/`hostname -f`.pem.`date +%Y-%m-%d_%T`' >> /home/gage/puppet-ca-replacement.log
  15. Sanity check: verify correct clock/date on palladium before proceeding with next step
  16. Regenerate the puppet CA keys & cert on palladium:
    • sudo puppet cert --generate `hostname -f`
    • Creates CA keypair + cert, server keypair + cert, and associated files (CA key passphrase, serial, CRL, inventory.txt)[6] in server/ssl/
  17. Copy server/ssl/ from palladium to strontium using a tar pipe:
    • From admin's laptop: ssh palladium.eqiad.wmnet 'sudo tar cvf - /var/lib/puppet/server/ssl' | ssh strontium.eqiad.wmnet 'sudo tar xvf - -C /'
  18. Fix puppet.conf on palladium + strontium by manually editing the file AND merging a change in the Git repo to remove these lines:
    • certname = puppet
    • hostcert = /var/lib/puppet/server/ssl/certs/palladium.eqiad.wmnet.pem
    • hostprivkey = /var/lib/puppet/server/ssl/private_keys/palladium.eqiad.wmnet.pem
      • This change is needed because prior to this procedure, the hostcert and hostprivkey settings were invalid and ignored, and the cert is called puppet.pem (Subject: CN=puppet). With the newly generated server cert, the CN is the FQDN, and the "puppet" service name is supported by DNS:puppet in the X509v3 Subject Alternative Name field of the cert. hostcert and hostprivkey are now correct but no longer need to be explicitly set because they use their default values.
  19. The Apache backend on port 8141 uses the puppet agent certs for SSL, so puppet agent must be run on these nodes before we can start the Apache puppetmaster. To bootstrap past the circular dependency of needing to run the agent before the master is started, temporarily enable and start the standalone (webrick) puppetmaster, run the agents, then shut it down and disable it again:
    • sudo sed -i 's/START=no/START=yes/' /etc/default/puppetmaster && sudo service puppetmaster start
    • sudo salt -L 'palladium.eqiad.wmnet,strontium.eqiad.wmnet' cmd.run 'puppet agent --enable && puppet agent --onetime --no-daemonize' >> /home/gage/puppet-ca-replacement.log
    • sudo sed -i 's/START=yes/START=no/' /etc/default/puppetmaster && sudo service puppetmaster stop
    • Now Apache has all the certs it needs to start up the real Passenger-based puppetmaster.
  20. Restart Apache puppetmaster 'site' & confirm success
    • Ensure we're still tailing Apache's log from step 9jobs
    • sudo a2ensite 50-puppetmaster-wikimedia-org.conf && sudo service apache2 reload
    • Confirm it's listening on 8040 and 8041: sudo lsof -i -n -P | grep apache | grep LISTEN | awk '{print $9}' | sort -u
  21. Sanity check: run puppet agent & observe result on cp1065:
    • From admin's laptop: ssh cp1065.eqiad.wmnet "sudo puppet agent -tv"
  22. Enable puppet agents for all other servers
    • sudo salt -b 200 '*' cmd.run 'puppet agent --enable' >> /home/gage/puppet-ca-replacement.log
  23. Wait ~23 minutes for agent to run on all nodes
    • sleep 1380 && echo "ready to continue"
  24. Confirm that agents have run by counting auto-signed certs, then remove autosign.conf:
    • sudo wc -l /etc/puppet/autosign.conf
    • sudo ls /var/lib/puppet/server/ssl/ca/signed | wc -l
    • sudo rm /etc/puppet/autosign.conf
  25. The next agent invocation will:
    • Copy SSL certs + keys to /etc/ipsec.d/ for Strongswan, and restarts services.
    • Create copies of SSL certs + keys for Bacula, and restarts services.
      • On Helium: bacula-`hostame -f`.pem
      • On all hosts: bacula-keypair-`hostname -f`.pem
  26. Confirm:
    • Successful agent runs
    • Successful Bacula operation
    • Successful IPsec transport establishment: ipsec status
  27. Re-queue jobs in Bacula: TODO based on file created above
  28. Re-pool IPsec nodes cp3030 & cp1065
  29. Re-set maintenance for any puppet agent checks in Icinga already set before maintenance began
  30. Note any hosts which were offline during this maintenance, as their puppet agents will need to be updated
  31. Check Icinga for any remaining issues

Caveats

  1. On puppetmasters, we set ssldir twice: in [main] we set it to /var/lib/puppet/ssl/, in [master] we override it to /var/lib/puppet/server/ssl/. That's ok, it's just complicated. However puppet config print gets confused by this, so sudo puppet config print ssldir outputs /var/lib/puppet/ssl/ instead of /var/lib/puppet/server/ssl/. Newer versions of puppet support a --section master argument, but that's not available in 3.4.3.[7] Instead we can use sudo puppet master --configprint ssldir.
  2. /var/lib/puppet/client/ssl/ is referenced only in modules/puppet/manifests/self/config.pp, and not found anywhere in production. It's only used for self-hosted puppetmasters in labs, and per the comments in the manifest source this directory is specified as ssldir only to avoid conflicts with previously generated puppet certificates from the normal puppet setup.
  3. Prior to this procedure, files specified in the [master] section don't actually exist:
    • gage@palladium:~$ grep .pem /etc/puppet/puppet.conf | cut -d= -f2 | xargs sudo ls -ld
    • ls: cannot access /var/lib/puppet/server/ssl/certs/palladium.eqiad.wmnet.pem: No such file or directory
    • ls: cannot access /var/lib/puppet/server/ssl/private_keys/palladium.eqiad.wmnet.pem: No such file or directory
    • Instead it loads /var/lib/puppet/server/ssl/{certs,private_keys}/puppet.pem based on ssldir = /var/lib/puppet/server/ssl/ and certname = puppet

Q&A

  1. Shall we regenerate the CA keypair or simply regenerate the cert?
    • Puppet 3.1+ uses RSA 4096 + SHA256 for 'puppet cert generate', whereas older versions and our current keypair use RSA 1024 + SHA1. Regenerating the cert from the existing key will only address the validity dates, it won't fix the key size. Therefore we should regen a new keypair.
  2. Are there issues with supporting this new keypair on our remaining Lucid nodes?
    • Our only remaining node <12.04 is Sodium. Its version of OpenSSL, 0.9.8k-7ubuntu8.23, supports sha256 (echo test | openssl dgst -sha256), therefore no problems are anticipated.
  3. What happens to in-flight puppet agent catalog runs when the agent is disabled?
    • Nothing. The run completes as normal, and the request to disable the agent completes without errors and affects only subsequent agent runs. Just what we want.
  4. How shall we handle the fact that Bacula's Director forgets its queue of scheduled jobs when it is restarted?
  5. Will PyBal be affected by this cert change?
    • No. The URLs in /etc/pybal/pybal.conf are HTTP, not HTTPS.
  6. Shall the CA cert be the self-signed one generated by puppet, generated from a WMF internal CA, or chained from one of our commercial certs?
    • To use anything but a self-signed cert, we would have to use puppet's external CA support, which means: "Puppet cannot automatically distribute certificates in these configurations — you must have your own complete system for issuing and distributing certificates."[8] So we will continue using the standard self-signed CA cert.
  7. How do we determine what hash function is used on the certs?
    • We can't, it's hardcoded to SHA256.
  8. Are the puppet CA cert or key stored in our private git repo?
    • No, but they probably should be. TODO
  9. Why does palladium:/var/lib/puppet/server/ssl/certs/ contain the certs for 36 random hosts from March 2 2015?
    • I don't know, seems like user error of some sort. /var/lib/puppet/server/ssl/private_keys/acamar.wikimedia.org.pem shouldn't be there either. (!) But it doesn't matter because this procedure removes that mess.
  10. How will Labs be affected by this change?
  11. Should the CA host be the puppetmaster (palladium) or another host? Is there a canonical host we use for signing other keys?
  12. How will IPsec transports behave during this procedure?
    • Expected: they will fail during the time when one agent has run and there is a resulting CA cert mismatch. Therefore we will depool nodes using IPsec before replacing certs.
  13. Will existing Bacula backups be restoreable after the procedure is complete?
    • Yes, because we will back up host certs and original CA.
  14. This procedure is made more complex because Puppet's SSL certs are reused for other applications which require SSL: Bacula and Strongswan. Is there a puppet module which will make key management less painful so that we can maintain independent keys for these services?
  15. How does LDAP use certs?
    • /etc/ldap/ldap.conf references /etc/ssl/certs/ca-certificates.crt, which is independent of Puppet's certs.
  16. How does labs use certs?
    • The labs puppetmaster node, labcontrol1001.wikimedia.org, is a normal client of the prod puppetmaster, with agent certs in /var/lib/puppet/ssl/. The puppetmaster certs in /var/lib/puppet/server/ssl/ are independent and will not be affected by this change.
  17. Do the website certs in /etc/ssl/ have any relationship with the puppet certs?
    • No.

References

Citations