Fundraising/techops/procedures/services-datacenter cutover procedure

From Wikitech

Note there is a related task to improve this: T266810

  1. Stop campaigns
    1. https://collab.wikimedia.org/wiki/Fundraising/Engineering/Shutting_the_pipeline_down
  2. Promote codfw frpm server
    1. adjust origin for your local copies of frack git repositories
    2. update $puppet_ca and $ssl_ca_notification_host in puppet/hieradata/common.yaml
    3. deploy the puppet changes
  3. Globally disable services that write to frqueue and frdb
    1. update $down_for_maintenance in puppet/hieradata/site/common.yaml to include "civi.*|frdata.*|frpig.*|payments."
    2. deploy the puppet changes
    3. do a puppet-agent run on on civi*, payments*, frpig*
    4. make sure any running jobs on civi servers terminate
  4. Switch production/external DNS service hostnames to point to codfw servers
    1. In main operations dns repo, edit dns/templates/wikimedia.org
      frbast 5M IN CNAME frbast-codfw
      frmon 1H IN CNAME frmon-codfw
      civicrm 1H IN A 208.80.152.232
      payments 5M IN CNAME payments-codfw
      payments-listener 5M IN CNAME payments-listener-codfw
    2. edit dns/templates/152.80.208.in-addr.arpa zone
      232 1H IN PTR civicrm.wikimedia.org.
    3. deploy the DNS changes
  5. Promote codfw auth server to to primary kdc
    1. update $kerberos_kdc_primary in puppet/hieradata/common.yaml
    2. deploy the puppet changes
  6. Refactor redis replication
    1. Swap redis origin server to one of the codfw redis servers according to this plan: Redis Frqueue
  7. Promote a codfw payments server to paymentsdb mariadb origin (we usually use the lowest numbered host)
    1. stop replication and enable writes on the new paymentsdb origin server
      MariaDB [(none)]> stop replica; reset replica all;
      MariaDB [(none)]> set global read_only = off;
    2. switch replication on each of the other codfw payments servers
      stop replica; change master to MASTER_HOST = '<paymentsdb origin server fqdn>'; start replica;
    3. update $payments_wiki_db_writer in puppet/hieradata/site/codfw.yaml
    4. deploy the puppet change
  8. Promote a codfw database server to fundraisingdb mariadb origin
    1. stop replication client on each frdb server
      MariaDB [(none)]> stop replica;
    2. choose a database to promote to origin, if servers aren't at the same master log position you'll need the one that is ahead
      MariaDB [(none)]> show replica status\G
    3. disable replication client and enable writes on new origin server
      MariaDB [(none)]> reset replica all;
      MariaDB [(none)]> set global read_only = off;
    4. adjust replication client on each replica accordingly
      MariaDB [(none)]> change master to MASTER_HOST = '<fundraisingdb origin server fqdn>'; start replica;
    5. test replication
    6. update $fundraising_db_origin in puppet/hieradata/site/common.yaml
    7. deploy the puppet changes
    8. Switch production/external DNS service hostnames to point to codfw servers
      1. (in prod dns repository) dns/templates/wikimedia.org
        fundraisingdb-write 5M IN CNAME frdb2001.frack.codfw.wmnet.
        fundraisingdb-read 5M IN CNAME frdb2002.frack.codfw.wmnet.
      2. deploy the DNS changes
  9. Promote the codfw frdb-analytics server to frdb-analytics mariadb origin
    1. stop replication client and enable writes on codfw frdb-analytics server
      MariaDB [(none)]> stop replica
      MariaDB [(none)]> set global read_only = off
    2. update $analytics_db_writer in puppet/hieradata/site/common.yaml
    3. deploy the puppet changes
  10. Restore Grafana content to on codfw frmon server
    Grafana stores its dashboards in a sqlite database at /var/lib/grafana/grafana.db, and the codfw frmon server has a recent copy in its the snapshot tree.
    1. stop grafana server
      systemctl stop grafana-server.service
    2. refresh grafana database
      rsync -var --delete /srv/snapshots/{eqiad frmon}/{snapshot date}/var/lib/grafana/ /var/lib/grafana/
    3. start grafana server
      systemctl start grafana-server.service
  11. Evaluate the situation.
    Once puppet and DNS changes propagate, we are able to reenable payments and payments-listener and put campaigns back online, but whether or not that makes sense depends on the situation. With eqiad down, data redundancy is reduced to less than half the machines all at one geographic location. We need to be confident in our ability to get the backend back online quickly, so we don't generate a queue backlog that is impractical to consume.
  12. [maybe] Take payments and payments-listener out of maintenance mode
    1. update $down_for_maintenance in puppet/hieradata/site/common.yaml
    2. deploy puppet changes
  13. Adjust incoming mail routing of fundraising domains in [production]puppet/modules/role/templates/exim/exim4.conf.mx.erb
  14. Adjust central logging configuration to exclude missing eqiad servers
    1. update $syslog_servers in puppet/hieradata/site/*.yaml
    2. update $siem_log_servers in puppet/hieradata/site/common.yaml
    3. deploy the puppet changes
  15. Take the codfw civicrm server out of maintenance mode
    1. update $down_for_maintenance in puppet/hieradata/site/common.yaml
    2. deploy the puppet change
  16. Migrate civicrm process-control jobs to codfw civicrm server (to be handled by FR-Tech)
    1. make sure puppet has process-control enabled
    2. enable jobs individually in localsettings, remembering to adjust CIVICRM_SMTP_HOST where it is used
    3. deploy the process-control changes
    4. repeat until all jobs are running
  17. Promote codfw puppet server to puppet-ca
    FIX: needs testing, see Puppet_CA_replacement
  18. Adjust backups
    1. Remove eqiad hosts from puppet/modules/fundraising/templates/archive_sync.erb
    2. Evaluate puppet/modules/fundraising/templates/archive_purge to make sure it doesn't purge any data we will need
  19. Adjust $nagios_masters in puppet/hieradata/site/common.yaml as necessary
  20. Restore FundraisingCA and fundraising cert stores from backups to the new primary frpm server
    1. fetch a backup from the eqiad frpm server from the file archive on frlog or frbackup
    2. decrypt using one of the fr_tech_ops gpg keys
    3. restore /etc/ssl/fundraisingCA and /etc/ssl/frack_certs_backup
    4. update $ssl_ca_notification_host in puppet/hieradata/site/common.yaml
    5. deploy the puppet changes
  21. Restore backups to codfw analytics application server
    FIX: sync /home, /var/lib/git, /srv so we don't have to restore it
  22. Obtain a new frdev database and application server, restore from backups