Fundraising/techops/procedures/services-datacenter cutover procedure
Appearance
< Fundraising | techops
Note there is a related task to improve this: T266810
- Stop campaigns
- Promote codfw frpm server
- adjust origin for your local copies of frack git repositories
- update $puppet_ca and $ssl_ca_notification_host in puppet/hieradata/common.yaml
- deploy the puppet changes
- Globally disable services that write to frqueue and frdb
- update $down_for_maintenance in puppet/hieradata/site/common.yaml to include "civi.*|frdata.*|frpig.*|payments."
- deploy the puppet changes
- do a puppet-agent run on on civi*, payments*, frpig*
- make sure any running jobs on civi servers terminate
- Switch production/external DNS service hostnames to point to codfw servers
- In main operations dns repo, edit dns/templates/wikimedia.org
- frbast 5M IN CNAME frbast-codfw
- frmon 1H IN CNAME frmon-codfw
- civicrm 1H IN A 208.80.152.232
- payments 5M IN CNAME payments-codfw
- payments-listener 5M IN CNAME payments-listener-codfw
- edit dns/templates/152.80.208.in-addr.arpa zone
- 232 1H IN PTR civicrm.wikimedia.org.
- deploy the DNS changes
- In main operations dns repo, edit dns/templates/wikimedia.org
- Promote codfw auth server to to primary kdc
- update $kerberos_kdc_primary in puppet/hieradata/common.yaml
- deploy the puppet changes
- Refactor redis replication
- Swap redis origin server to one of the codfw redis servers according to this plan: Redis Frqueue
- Promote a codfw payments server to paymentsdb mariadb origin (we usually use the lowest numbered host)
- stop replication and enable writes on the new paymentsdb origin server
- MariaDB [(none)]> stop replica; reset replica all;
- MariaDB [(none)]> set global read_only = off;
- switch replication on each of the other codfw payments servers
- stop replica; change master to MASTER_HOST = '<paymentsdb origin server fqdn>'; start replica;
- update $payments_wiki_db_writer in puppet/hieradata/site/codfw.yaml
- deploy the puppet change
- stop replication and enable writes on the new paymentsdb origin server
- Promote a codfw database server to fundraisingdb mariadb origin
- stop replication client on each frdb server
- MariaDB [(none)]> stop replica;
- choose a database to promote to origin, if servers aren't at the same master log position you'll need the one that is ahead
- MariaDB [(none)]> show replica status\G
- disable replication client and enable writes on new origin server
- MariaDB [(none)]> reset replica all;
- MariaDB [(none)]> set global read_only = off;
- adjust replication client on each replica accordingly
- MariaDB [(none)]> change master to MASTER_HOST = '<fundraisingdb origin server fqdn>'; start replica;
- test replication
- update $fundraising_db_origin in puppet/hieradata/site/common.yaml
- deploy the puppet changes
- Switch production/external DNS service hostnames to point to codfw servers
- (in prod dns repository) dns/templates/wikimedia.org
- fundraisingdb-write 5M IN CNAME frdb2001.frack.codfw.wmnet.
- fundraisingdb-read 5M IN CNAME frdb2002.frack.codfw.wmnet.
- deploy the DNS changes
- (in prod dns repository) dns/templates/wikimedia.org
- stop replication client on each frdb server
- Promote the codfw frdb-analytics server to frdb-analytics mariadb origin
- stop replication client and enable writes on codfw frdb-analytics server
- MariaDB [(none)]> stop replica
- MariaDB [(none)]> set global read_only = off
- update $analytics_db_writer in puppet/hieradata/site/common.yaml
- deploy the puppet changes
- stop replication client and enable writes on codfw frdb-analytics server
- Restore Grafana content to on codfw frmon server
- Grafana stores its dashboards in a sqlite database at /var/lib/grafana/grafana.db, and the codfw frmon server has a recent copy in its the snapshot tree.
- stop grafana server
- systemctl stop grafana-server.service
- refresh grafana database
- rsync -var --delete /srv/snapshots/{eqiad frmon}/{snapshot date}/var/lib/grafana/ /var/lib/grafana/
- start grafana server
- systemctl start grafana-server.service
- Evaluate the situation.
- Once puppet and DNS changes propagate, we are able to reenable payments and payments-listener and put campaigns back online, but whether or not that makes sense depends on the situation. With eqiad down, data redundancy is reduced to less than half the machines all at one geographic location. We need to be confident in our ability to get the backend back online quickly, so we don't generate a queue backlog that is impractical to consume.
- [maybe] Take payments and payments-listener out of maintenance mode
- update $down_for_maintenance in puppet/hieradata/site/common.yaml
- deploy puppet changes
- Adjust incoming mail routing of fundraising domains in [production]puppet/modules/role/templates/exim/exim4.conf.mx.erb
- Adjust central logging configuration to exclude missing eqiad servers
- update $syslog_servers in puppet/hieradata/site/*.yaml
- update $siem_log_servers in puppet/hieradata/site/common.yaml
- deploy the puppet changes
- Take the codfw civicrm server out of maintenance mode
- update $down_for_maintenance in puppet/hieradata/site/common.yaml
- deploy the puppet change
- Migrate civicrm process-control jobs to codfw civicrm server (to be handled by FR-Tech)
- make sure puppet has process-control enabled
- enable jobs individually in localsettings, remembering to adjust CIVICRM_SMTP_HOST where it is used
- deploy the process-control changes
- repeat until all jobs are running
- Promote codfw puppet server to puppet-ca
- FIX: needs testing, see Puppet_CA_replacement
- Adjust backups
- Remove eqiad hosts from puppet/modules/fundraising/templates/archive_sync.erb
- Evaluate puppet/modules/fundraising/templates/archive_purge to make sure it doesn't purge any data we will need
- Adjust $nagios_masters in puppet/hieradata/site/common.yaml as necessary
- Restore FundraisingCA and fundraising cert stores from backups to the new primary frpm server
- fetch a backup from the eqiad frpm server from the file archive on frlog or frbackup
- decrypt using one of the fr_tech_ops gpg keys
- restore /etc/ssl/fundraisingCA and /etc/ssl/frack_certs_backup
- update $ssl_ca_notification_host in puppet/hieradata/site/common.yaml
- deploy the puppet changes
- Restore backups to codfw analytics application server
- FIX: sync /home, /var/lib/git, /srv so we don't have to restore it
- Obtain a new frdev database and application server, restore from backups