Fundraising/techops/procedures/services-datacenter cutover procedure

Note there is a related task to improve this: T266810

Stop campaigns
1. https://collab.wikimedia.org/wiki/Fundraising/Engineering/Shutting_the_pipeline_down
Promote codfw frpm server
1. adjust origin for your local copies of frack git repositories
2. update $puppet_ca and $ssl_ca_notification_host in puppet/hieradata/common.yaml
3. deploy the puppet changes
Globally disable services that write to frqueue and frdb
1. update $down_for_maintenance in puppet/hieradata/site/common.yaml to include "civi.*|frdata.*|frpig.*|payments."
2. deploy the puppet changes
3. do a puppet-agent run on on civi*, payments*, frpig*
4. make sure any running jobs on civi servers terminate
Switch production/external DNS service hostnames to point to codfw servers
1. In main operations dns repo, edit dns/templates/wikimedia.org
  frbast 5M IN CNAME frbast-codfw
  
  frmon 1H IN CNAME frmon-codfw
  
  civicrm 1H IN A 208.80.152.232
  
  payments 5M IN CNAME payments-codfw
  
  payments-listener 5M IN CNAME payments-listener-codfw
2. edit dns/templates/152.80.208.in-addr.arpa zone
  232 1H IN PTR civicrm.wikimedia.org.
3. deploy the DNS changes
Promote codfw auth server to to primary kdc
1. update $kerberos_kdc_primary in puppet/hieradata/common.yaml
2. deploy the puppet changes
Refactor redis replication
1. Swap redis origin server to one of the codfw redis servers according to this plan: Redis Frqueue
Promote a codfw payments server to paymentsdb mariadb origin (we usually use the lowest numbered host)
1. stop replication and enable writes on the new paymentsdb origin server
  MariaDB [(none)]> stop replica; reset replica all;
  
  MariaDB [(none)]> set global read_only = off;
2. switch replication on each of the other codfw payments servers
  stop replica; change master to MASTER_HOST = '<paymentsdb origin server fqdn>'; start replica;
3. update $payments_wiki_db_writer in puppet/hieradata/site/codfw.yaml
4. deploy the puppet change
Promote a codfw database server to fundraisingdb mariadb origin
1. stop replication client on each frdb server
  MariaDB [(none)]> stop replica;
2. choose a database to promote to origin, if servers aren't at the same master log position you'll need the one that is ahead
  MariaDB [(none)]> show replica status\G
3. disable replication client and enable writes on new origin server
  MariaDB [(none)]> reset replica all;
  
  MariaDB [(none)]> set global read_only = off;
4. adjust replication client on each replica accordingly
  MariaDB [(none)]> change master to MASTER_HOST = '<fundraisingdb origin server fqdn>'; start replica;
5. test replication
6. update $fundraising_db_origin in puppet/hieradata/site/common.yaml
7. deploy the puppet changes
8. Switch production/external DNS service hostnames to point to codfw servers
  1. (in prod dns repository) dns/templates/wikimedia.org
    fundraisingdb-write 5M IN CNAME frdb2001.frack.codfw.wmnet.
    
    fundraisingdb-read 5M IN CNAME frdb2002.frack.codfw.wmnet.
  2. deploy the DNS changes
Promote the codfw frdb-analytics server to frdb-analytics mariadb origin
1. stop replication client and enable writes on codfw frdb-analytics server
  MariaDB [(none)]> stop replica
  
  MariaDB [(none)]> set global read_only = off
2. update $analytics_db_writer in puppet/hieradata/site/common.yaml
3. deploy the puppet changes
Restore Grafana content to on codfw frmon server
Grafana stores its dashboards in a sqlite database at /var/lib/grafana/grafana.db, and the codfw frmon server has a recent copy in its the snapshot tree.
1. stop grafana server
  systemctl stop grafana-server.service
2. refresh grafana database
  rsync -var --delete /srv/snapshots/{eqiad frmon}/{snapshot date}/var/lib/grafana/ /var/lib/grafana/
3. start grafana server
  systemctl start grafana-server.service
Evaluate the situation.
Once puppet and DNS changes propagate, we are able to reenable payments and payments-listener and put campaigns back online, but whether or not that makes sense depends on the situation. With eqiad down, data redundancy is reduced to less than half the machines all at one geographic location. We need to be confident in our ability to get the backend back online quickly, so we don't generate a queue backlog that is impractical to consume.
[maybe] Take payments and payments-listener out of maintenance mode
1. update $down_for_maintenance in puppet/hieradata/site/common.yaml
2. deploy puppet changes
Adjust incoming mail routing of fundraising domains in [production]puppet/modules/role/templates/exim/exim4.conf.mx.erb
Adjust central logging configuration to exclude missing eqiad servers
1. update $syslog_servers in puppet/hieradata/site/*.yaml
2. update $siem_log_servers in puppet/hieradata/site/common.yaml
3. deploy the puppet changes
Take the codfw civicrm server out of maintenance mode
1. update $down_for_maintenance in puppet/hieradata/site/common.yaml
2. deploy the puppet change
Migrate civicrm process-control jobs to codfw civicrm server (to be handled by FR-Tech)
1. make sure puppet has process-control enabled
2. enable jobs individually in localsettings, remembering to adjust CIVICRM_SMTP_HOST where it is used
3. deploy the process-control changes
4. repeat until all jobs are running
Promote codfw puppet server to puppet-ca
FIX: needs testing, see Puppet_CA_replacement
Adjust backups
1. Remove eqiad hosts from puppet/modules/fundraising/templates/archive_sync.erb
2. Evaluate puppet/modules/fundraising/templates/archive_purge to make sure it doesn't purge any data we will need
Adjust $nagios_masters in puppet/hieradata/site/common.yaml as necessary
Restore FundraisingCA and fundraising cert stores from backups to the new primary frpm server
1. fetch a backup from the eqiad frpm server from the file archive on frlog or frbackup
2. decrypt using one of the fr_tech_ops gpg keys
3. restore /etc/ssl/fundraisingCA and /etc/ssl/frack_certs_backup
4. update $ssl_ca_notification_host in puppet/hieradata/site/common.yaml
5. deploy the puppet changes
Restore backups to codfw analytics application server
FIX: sync /home, /var/lib/git, /srv so we don't have to restore it
Obtain a new frdev database and application server, restore from backups