Switch Datacenter

From Wikitech
Jump to: navigation, search

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for 2017 switch

See phab:T138810 for tasks to be undertaken during the switch

  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Tuesday, April 18th 2017 14:30 UTC
  • Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
  • Traffic: Tuesday, April 18th 2017 19:00 UTC
  • MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
  • Deployment server: Wednesday, April 19th 2017 16:00 UTC

Switching back

  • Traffic: Pre-switchback in two phases: Mon May 1 and Tues May 2 (to avoid cold-cache issues Weds)
  • MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Thursday, May 4th 2017 14:30 UTC
  • Swift: Thursday, May 4th 2017 15:30 UTC
  • Deployment server: Thursday, May 4th 2017 16:00 UTC

Per-service switchover instructions

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [1]

Days in advance preparation

  1. Warm up databases; see MariaDB/buffer pool dump.
  2. Prepare puppet patches:
  3. Prepare the mediawiki-config patch or patches (eqiad->codfw; codfw->eqiad)

Stage 0 - preparation

  1. Disable puppet on all MediaWiki jobqueues/videoscalers and maintenance hosts and cache::text in both eqiad and codfw. switchdc t00_disable_puppet.py sample output
  2. Merge the mediawiki-config switchover changes but don't sync This is not covered by the switchdc script
  3. Stop swiftrepl on me-fe1005 This is not covered by the switchdc script
  4. Reduce the TTL on appservers-rw, api-rw, imagescaler-rw to 10 seconds: switchdc t00_reduce_ttl.py sample output

Phase 1 - stop maintenance

  1. Stop jobqueues in the active site and kill all the cronjobs on the maintenance host in the active site: switchdc t01_stop_maintenance.py sample output

Phase 2 - read-only mode

  1. Go to read-only mode by syncing wmf-config/db-$old-site.php: switchdc t02_start_mediawiki_readonly.py sample output

Phase 3 - lock down database masters

  1. Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode: switchdc t03_coredb_masters_readonly.py sample output

Phase 4 - Wipe caches in the new site and warmup them

  1. All the following tasks are performed by switchdc t04_cache_wipe.py sample output
    1. Wait for the new site's databases to catch up replication
    2. Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
    3. Restart all HHVM servers in the new site to clear the APC cache
    4. Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
      • The global warmup against the appservers cluster
      • The apc-warmup against all hosts in the appservers cluster.
  2. Resync redises in the destination datacenter using switchdc t04_resync_redis.py
  3. Merge and puppet-merge the traffic change for text caches eqiad->codfw; codfw->eqiad. This is not covered by the switchdc script..

Phase 5 - switch active datacenter configuration

  1. Send the traffic layer to active-active: switchdc t05_switch_traffic.py sample output
    • enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will go to both MW clusters)
    • ensure that the change was applied on all hosts in $new_site
    • Run puppet on the text caches in $old_site. This ends the active-active phase.
  2. Merge the switch of $mw_primary at this point. This change can actually be puppet-merged together with the varnish one. This is not covered by the switchdc script. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
  3. Switch the discovery: switchdc t05_switch_datacenter sample output
    • Flip appservers-rw, api-rw, imagescaler-rw to pooled=true in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
    • Deploy wmf-config/ConfigSettings.php changes to switch the datacenter in MediaWiki
    • Flip appservers-rw, api-rw, imagescaler-rw to pooled=false in the old site. After this, DNS will be changed and internal applications will start hitting the new DC

Phase 6 - Redis replicas

  1. Switch the live redis configuration. Verify redises are indeed replicating correctly: switchdc t06_redis sample output

Phase 7 - Set new site's databases to read-write

  1. Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode: switchdc t07_coredb_masters_readwrite.py sample output

Phase 8 - Set MediaWiki to read-write

  1. Deploy mediawiki-config wmf-config/db-$new-site.php with all shards set to read-write: switchdc t08_stop_mediawiki_readonly.py sample output

Phase 9 - post read-only

  1. Start maintenance in the new DC: switchdc t09_start_maintenance.py sample output
    1. the jobqueue in the new site by running puppet there (mw_primary controls it)
    2. Run puppet on the maintenance hosts (mw_primary controls it)
  2. Update tendril for new database masters: switchdc t09_tendril.py sample output
  3. Restart parsoid: switchdc t09_restart_parsoid.py sample output
  4. Set the TTL for the DNS records to 300 seconds again: switchdc t09_restore_ttl.py sample output
  5. Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script
  6. Start swiftrepl on codfw This is not covered by the switchdc script
  7. [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force This is not covered by the switchdc script

Phase 10 - verification and troubleshooting

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, RCStream and the IRC feeds)
  3. Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Media storage/Swift

Switchover

  • Set temporary active/active for Swift
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347859/
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'
  • The above must complete correctly and fully (applying the change)
  • Set Swift to active/passive in codfw only:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347860/
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'

Switching back

Repeat the steps above in reverse order, with suitable revert commits

ElasticSearch

CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Traffic

General information on generic procedures

https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing

Switchover

Inter-Cache Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347613/
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'cp3*.esams.wmnet' 'run-puppet-agent -q'

GeoDNS (User-facing) Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347616
  2. <any authdns node>: authdns-update

Switchback

Same procedures as above, with reversions of the commits specified. The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues.

Traffic - Services

For reference, the public-facing services involved which are confirmed active/active or failover-capable (other than MW and Swift, handled elsewhere):

  • cache_text:
    • restbase (active/passive for now)
    • cxserver (active/active)
    • citoid (active/active)
  • cache_maps:
    • kartotherian (active/passive for now)
  • cache_misc:
    • noc (active/active)
    • pybal_config (active/active)
    • wdqs (active/active)
    • ores (active/active)
    • eventstreams (active/active)

Switchover

  • Set temporary active/active for active/passive services above:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347852/
  2. <any frontend puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(maps|text)"' 'run-puppet-agent'
  • The above must complete correctly and fully (applying the change)
  • Set all active/active (including temps above) to active/passive in codfw only:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347853/
  2. <any frontend puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(misc|maps|text)"' 'run-puppet-agent'

Switchback

Reverse the above with reverted commits.

Services

All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:

  1. reduce the TTL of the dns discovery records to 10 seconds
  2. depool the datacenter we're moving away from in confctl / discovery
  3. restore the original TTL

Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.

Other miscellaneous

Schedule of past switches

Schedule for 2016 switch

  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done