User:Giuseppe Lavagetto/Switch Dc 2017

Per-service switchover instructions

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [1]

Phase 0 - preparation

(days in advance) Warm up databases; see MariaDB/buffer_pool_dump.
(days in advance) Prepare puppet patches:
- Switch mw_primary [2]
- Add direct route to $new_site for all mw-related cache::app_directors [3]
- Comment-out direct route to $old_site in all mw-related cache::app_directors [4]
Disable puppet on all jobqueues/videoscalers and maintenance hosts, and the varnishes
Merge the mediawiki-config switchover changes but don't sync This is not covered by the switchdc script
Reduce the TTL on appservers-rw, api-rw, imagescaler-rw to 10 seconds

Phase 1 - stop maintenance

Stop jobqueues in the active site
Kill all the cronjobs on the maintenance host in the active site

Phase 2 - read-only mode

Go to read-only mode by syncing wmf-config/db-$old-site.php

Phase 3 - lock down database masters

Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode.
Wait for the new site's databases to catch up replication

Phase 4.1 - Wipe caches

Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
Restart all HHVM servers in the new site to clear the APC cache

Phase 4.2 - Warmup caches in the new site

This phase will be executed by the t04_cache_wipe task of switchdc, because there is no speed gain from not doing all of phase 4.1 + phase 4.2 separately, and they are logically related.

Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
- The global warmup against the appservers cluster
- The apc-warmup against all hosts in the appservers and api clusters at least.

Phase 5 - switch active datacenter configuration

Merge the switch of $mw_primary at this point and add direct route to $new_site for all mw-related cache::app_directors. both changes can be puppet-merged toghether. This is not covered by the switchdc script. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
Switch the discovery
- Flip appservers-rw, api-rw, imagescaler-rw to pooled=true in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
- Deploy wmf-config/ConfigSettings.php changes to switch the datacenter in MediaWiki
- Flip appservers-rw, api-rw, imagescaler-rw to pooled=false in the old site. After this, DNS will be changed and internal applications will start hitting the new DC

Phase 6 - apply configuration

Switch the live redis configuration. This can be either scripted, or all redises can be restarted (first in the new site, then in the old one). Verify redises are indeed replicating correctly.
Run puppet on the text caches in $new_site and $old_site. This starts the PII leak [TODO: check with traffic for the whole procedure]

Phase 7 - Set new site's databases to read-write

Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode.

Phase 8 - Set MediaWiki to read-write

Deploy mediawiki-config wmf-config/db-$new-site.php with all shards set to read-write

Phase 9 - post read-only

Start the jobqueue in the new site by running puppet there (mw_primary controls it)
Run puppet on the maintenance hosts (mw_primary controls it)
Update DNS records for new database masters
Update tendril for new database masters
Set the TTL for the DNS records to 300 seconds again.
Varnish final reconfiguration:
- Merge the second traffic puppet patch, comment-out direct route to $old_site in all mw-related cache::app_directors. This is not covered by the switchdc script
- Run puppet on all the cache nodes in $old_site this ends the PII leak
[Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force This is not covered by the switchdc script

Phase 10 - verification and troubleshooting

Make sure reading & editing works! :)
Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, RCStream and the IRC feeds)
Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Media storage/Swift

Ahead of the switchover, originals and thumbs

MediaWiki: Write synchronously to both sites with ~~https://gerrit.wikimedia.org/r/#/c/282888/~~https://gerrit.wikimedia.org/r/284652
Cache->app: Change varnish backends for swift and swift_thumbs to point to new site with ~~https://gerrit.wikimedia.org/r/#/c/282890/~~https://gerrit.wikimedia.org/r/284651
1. Force a puppet run on cache_upload in both sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
Inter-Cache: Switch new site from active site to 'direct' in cache::route_table for upload ~~https://gerrit.wikimedia.org/r/#/c/282891/~~https://gerrit.wikimedia.org/r/284650
1. Force a puppet run on cache_upload in new site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
Users: De-pool active site in GeoDNS ~~https://gerrit.wikimedia.org/r/#/c/283416/~~https://gerrit.wikimedia.org/r/#/c/284694/ + authdns-update
Inter-Cache: Switch all caching sites currently pointing from active site to new site in cache::route_table for upload ~~https://gerrit.wikimedia.org/r/#/c/283418/~~https://gerrit.wikimedia.org/r/284649
1. Force a puppet run on cache_upload in caching sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
Inter-Cache: Switch active site from 'direct' to new site in cache::route_table for upload ~~https://gerrit.wikimedia.org/r/#/c/282892/~~https://gerrit.wikimedia.org/r/284648
1. Force a puppet run on cache_upload in active site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'

Switching back

Repeat the steps above in reverse order, with suitable revert commits

ElasticSearch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Traffic

GeoDNS user routing

Traffic-layer only, no interdependencies elsewhere
Granularity is per-cache-cluster (misc, maps, text, upload)
Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#GeoDNS

Inter-Cache routing

Traffic-layer only, no interdependencies elsewhere
Granularity is per-cache-cluster (misc, maps, text, upload)
Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Inter-Cache_Routing

Cache->App routing

Normally will have inter-dependencies with application-level work
Granularity is per-application-service (how they're defined at the back end of varnish)
Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Cache-to-Application_Routing

Specifics for Switchover Test Week

After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:

The Upload cluster will be following similar instructions on the 14th during the Swift switch.
Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
This leaves just the text cluster to operate on below:

Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283430
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'
Users: De-pool eqiad in GeoDNS for the text cluster.
- https://gerrit.wikimedia.org/r/283433
- authdns-update on any one of the authdns servers (radon, baham, eeden)
Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283431
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'
Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283432
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'

Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:

Inter-Cache: Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284687
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'
Inter-Cache: Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284688
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'
Users: Re-pool eqiad in GeoDNS.
- https://gerrit.wikimedia.org/r/284692
- authdns-update on any one of the authdns servers (radon, baham, eeden)
Inter-Cache: Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284689
- Force a puppet run on affected caches:
- salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'

Services

RESTBase and Parsoid already active in codfw, using eqiad MW API.
Shift traffic to codfw:
- Public traffic: Update Varnish backend config.
- Update RESTBase and Flow configs in mediawiki-config to use codfw.
During MW switch-over:
- Update RESTBase and Parsoid to use MW API in codfw, either using puppet / Parsoid deploy, or DNS. See https://phabricator.wikimedia.org/T125069.

Tracker / checklist

Other miscellaneous

Deployment server
EventLogging
IRC/RCstream/EventStreams