Datacenter switchovers are a standard response to certain types of disasters (web search). Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between datacenters also make it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from datacenter A, we can do all that work at datacenter B.
A Wikimedia datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. SRE Service Operations maintains the process and software necessary to run the switchover.
The impact of the switchover (as of 2023-06-12) is expected to be 2-3 minutes of read-only for MediaWiki (this includes features built as MediaWiki extensions) and related software/infrastructures. This section documents, in broad strokes, that impact. None of the below applies to services/features/infrastructures not participating directly in the Switchover. Those, are out of scope and continue to work as normal. However, if above said components rely on MediaWiki indirectly (e.g. via some data pipeline) they might experience some minor impact, e.g. some delay in receiving events. This is expected.
MediaWiki goes read-only for 2 to 3 minutes of the hour on every Switchover. The process, in broad strokes, requires setting MediaWiki itself read-only and then the databases it uses read-only (non MediaWiki databases, as of 2023-06-12, are not part of this read-only scheme). On purpose, there is an amount of time that passes between the 2 events (to allow for the last in-flight edits to land safely). All read-only functionality (e.g. browsing articles) will continue to function as usual.
During the time frame that the MediaWiki databases are set read-only, any kind of write (UPDATE, DELETE, INSERT in SQL terms) requests that reach them will be denied. If a feature somehow ignores the global MediaWiki read-only configuration and tries to write something to a MediaWiki database, it's not going to work for those 2 to 3 minutes of the hour. Across a year, and assuming doing it twice per year, this is on the order of 0.001%, which we believe is perfectly acceptable.
Around 35 minutes for a complete run of the cookbook, from disabling puppet to re-enabling it, if timed right for the read-only part of the cookbook to fall at the start of the announced window. Doing it in an emergency can be done faster since there is no need to wait for a set time.
Around 2 to 3 minutes of this run are with MediaWiki read-only.
Excluding restbase, the complete run of the services cookbook takes approximately:
- 20 minutes for failing over completely from one datacenter to the other (leaving the origin datacenter depooled)
- 15 minutes to repool active/active services in the secondary datacenter
- 40 minutes to do an actual switchover (ending with the primary datacenter changed, and the secondary pooled for active/active services). This is because the cookbook doesn't currently support an actual switchover, only pooling and depooling datacenters, meaning we have to do two passes, one to fail over, and one to repool active/active services in the secondary datacenter.
Weeks in advance preparation
- 10 weeks before: Coordinate dates and communication plan with involved groups: Switch_Datacenter/Coordination
- 3 weeks before: Run a "live test" of the cookbooks by "switching" from the passive DC to the active DC. This live-test applies only to the
sre.switchdc.mediawikicookbook, use a dry-run for the
--live-testflag will skip actions that could harm the active DC or do them on the passive DC. This will instrument most of the code paths used in the switchover and help identify issues. This process will !log to SAL so you should coordinate with others, but otherwise should be non-disruptive. Due to changes since the last switchover you can expect code changes to become necessary, so take the time and assistance needed into account.
This step of the cookbook will fail if circular replication is not already enabled everywhere. It can be skipped if the live-test is run before circular replication is enabled, but it must be retested during Switch_Datacenter#Days_in_advance_preparation to ensure it is correctly set up.
Overall switchover flow
In a controlled switchover we first deactivate services in the primary datacenter and second deactivate caching in the datacenter. The next step is to switch Mediawiki itself. About a week later we activate caching in the datacenter again, as we believe that testing the situation without caching in the datacenter is sufficient.
Historically and until March 2023 scheduling looked like:
- Monday 14:00 UTC Services
- Monday 15:00 UTC Caching (traffic)
- Tuesday 14:00 UTC MediaWiki
Starting March 2023, we only have 2 windows
- Tuesday 14:00 TC Services+ Caching (traffic)
- Wednesday 14:00 UTC MediaWiki
As of September 2023, we will be running each datacenter as primary for half of the year.
Datacenter Switchovers will take place on the work week of the Solar Equinox.
We assume the Northward Solar Equinox happens on March 21st and the Southward Solar Equinox on September 21st. This doesn't match exactly the astronomical event on purpose.
The read-only part of Switchover, aka MediaWiki Switchover, will be happening always on the Wednesday of the above mentioned week. During read-only, which is around 2 to 3 minutes, all Wikis will not be accepting edits and editors will see a warning message asking to try again later. Read-only starts at 14:00 UTC. Readers should experience no changes for the entirety of the event.
Various non read-only parts of the Switchover will always take place on the Tuesday before the read-only part of the Switchover. There is no set time of the day for this one, contrary to the one above, as it is non disruptive and with much lower risk.
For the next 7 calendar days after the read-only phase of the Switchover, traffic will be flowing solely to one of the 2 data centers, effectively rendering the other datacenter inactive.
On the Wednesday following the read-only phase of the Switchover, that is right after exactly 7 days, traffic will start flowing, in the normal Multi-DC way, to both data centers. This period can be extended for secondary datacenter maintenance.
The concept of a Switchback, namely when we route all Wiki edits traffic back to our Virginia data center (eqiad), will cease to exist. The 2 data centers will be considered coequal, alternating roles every, roughly, 6 months.
Per-service switchover instructions
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in the cookbooks/sre/switchdc/mediawiki/ path.
Days in advance preparation
- OPTIONAL: SKIP IN AN EMERGENCY: Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
- There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
- Replication is flowing from eqiad -> codfw and from codfw -> eqiad (replication is usually stopped in the passive -> active direction to facilitate maintenance)
- All database servers have its buffer pool filled up. This is taken care automatically with the automatic buffer pool warmup functionality. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.
- These were the things we prepared/checked for the 2018 switch
- Check appserver weights on servers in the passive DC, make sure that newer hardware is weighted higher (usually 30) and older hardware is less (usually 25)
sre.switchdc.mediawiki.06-set-db-readwritein live-test mode back to back once circular replication is enabled. This is important
|Start the following steps about half an hour to an hour before the scheduled switchover time, in a tmux or a screen.|
The best way to run this multi step cookbook is to start it in interactive mode from the cookbook root
sudo cookbook sre.switchdc.mediawiki --ro-reason 'DC switchover (TXXXXXX)' codfw eqiad
and proceed through the steps
Phase 0 - preparation
- Add a scheduled maintenance on StatusPage (Maintenances -> Schedule Maintenance) This is not covered by the switchdc script.
- Add a scap lock on the deployment server
scap lock --all "Datacenter Switchover - T12345". Do this in another tmux window, as it will stay there for you to unlock at the end of the procedure.This is not covered by the switchdc script.
- Disable puppet on maintenance hosts in both eqiad and codfw: 00-disable-puppet.py
- Reduce the TTL on
appservers-rw, api-rw, jobrunner, videoscaler, parsoid-phpto 10 seconds: 00-reduce-ttl.py Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1, the cookbook should force you to wait.
- Optional Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize: 00-warmup-caches.py
- The global "urls-cluster" warmup against the appservers cluster
- The "urls-server" warmup against all hosts in the appservers cluster.
- The "urls-server" warmup against all hosts in the api-appservers cluster.
- Set downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. 00-downtime-db-readonly-checks.py
|Stop for GO/NOGO|
Phase 1 - stop maintenance
- Stop maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters: 01-stop-maintenance.py
Stop for final GO/NOGO before read-only.|
The following steps until Phase 7 need to be executed in quick succession to minimize read-only time
Phase 2 - read-only mode
- Go to read-only mode by changing the
ReadOnlyconftool value: 02-set-readonly.py
Phase 3 - lock down database masters
- Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: 03-set-db-readonly.py
Phase 4 - switch active datacenter configuration
- Switch the discovery records and MediaWiki active datacenter: 04-switch-mediawiki.py
appservers-rw, api-rw, jobrunner, videoscaler, parsoid-phpto
pooled=truein the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter.
WMFMasterDatacenterfrom the old site to the new.
appservers-rw, api-rw, jobrunner, videoscaler, parsoid-phpto
pooled=falsein the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC
Phase 5 - DEPRECATED - Invert Redis replication for MediaWiki sessions
This information is outdated. Redis is not used for MW sessions anymore. Cookbook has been removed, go directly to Phase 6 (last update: 2023)
Phase 6 - Set new site's databases to read-write
- Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: 06-set-db-readwrite.py
Phase 7 - Set MediaWiki to read-write
- Go to read-write mode by changing the
ReadOnlyconftool value: 07-set-readwrite.py
You are now out of read-only mode|
Phase 8 - Restore rest of MediaWiki
- Restart Envoy on the jobrunners that are now inactive, to trigger changeprop to re-resolve the DNS name and connect to the new DC: 08-restart-envoy-on-jobrunners.py
- A steady rate of 500s is expected until this step is completed, because changeprop will still be sending edits to jobrunners in the old DC, where the database master will reject them.
- Start maintenance in the new DC: 08-start-maintenance.py
- End the planned maintenance in StatusPage
Phase 9 - Post read-only
- Set the TTL for the DNS records to 300 seconds again: 09-restore-ttl.py
- Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script. Please use the following to SAL log
!log Phase 9.5 Update DNS records for new database masters
- Run Puppet on the database masters in both DCs, to update expected read-only state: 09-run-puppet-on-db-masters.py. This will remove the downtimes set in Phase 0.
- Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
- Cancel the scap lock. You will need to go back to the terminal where you added the lock and press enter This is not covered by the switchdc script.
- Re-order noc.wm.o's debug.json to have primary servers listed first, see T289745 and backport it using scap. This will test scap2 deployment This is not covered by the switchdc script.
- Update maintenance server DNS records eqiad -> codfw This is not covered by the switchdc script.
- Reorder the data centers in the default stanza for geomaps. Make sure the new primary DC is set first. This is not covered by the switchdc script. This can happen days after the switchover.
- Note: The default only affects a small portion of traffic, so this is mostly about logical consistency (when we have no idea what to do, we prefer the primary DC).
Phase 10 - verification and troubleshooting
This is not covered by the switchdc script
- Make sure reading & editing works! :)
- Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, and the IRC feeds)
curl -s -H 'Accept: application/json' https://stream.wikimedia.org/v2/stream/recentchange | jq .
- Make sure email works (
sudo -i; sudo exim4 -bp | exiqsumm | tail -n 5on mx1001/mx2001 it should fluctuate between 0m and a few minutes, test an email])
Put Listen to wikipedia in the background during the switchover. Silence indicates read-only, when it starts to make sounds again, edits are back up.
- App servers
- ATS cluster view (text)
- ATS backends<->Origin servers overview (appservers, api, restbase)
- Logstash: mediawiki-errors
General context on how to switchover
CirrusSearch talks by default to the local datacenter (
$wmgDatacenter). No special actions are required when disabling a datacenter.
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.
It is relatively straightforward for us to depool an entire datacenter at the traffic level, and is regularly done during maintenance or outages. For that reason, we tend to only keep the datacenter depooled for about a week, which allows us to test for a full traffic cycle (in theory).
General information on generic procedures
GeoDNS (User-facing) Routing:
- gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
- <any authdns node>: authdns-update
- SAL Log using the following
!log Traffic: depool eqiad from user traffic
authdns-update from any authdns node will update all nameservers.)
Same procedure as above, with reversion of the commit specified: GeoDNS.
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
- reduce the TTL of the DNS discovery records to 10 seconds
- depool the datacenter we're moving away from in confctl / discovery
- restore the original TTL
All of the above is done using the
sre.discovery.datacenter cookbook in the case of a global switchover
# Switch all services to codfw $ sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover" --task-id T12345
This will depool all active/active services, and prompt you to move or skip active/passive services.
# Repool eqiad $ sudo cookbook sre.discovery.datacenter pool eqiad --all --reason "Datacenter Switchback" --task-id T12345
This will repool all active/active services, and prompt you to move or skip active/passive services.
If it is needed to exclude services, using the old
sre.switchdc.services is still necessary until exclusion is implemented.
# Switch all services to codfw, excluding parsoid and cxserver $ sudo cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw
If you are switching only one service, using the old
sre.switchdc.services is still necessary
# Switch the service "parsoid" to codfw-only $ sudo cookbook sre.switchdc.services --services parsoid -- eqiad codfw
apt.wikimedia.org needs a puppet change
Restbase-async is a bit of a special case, being pooled active/passive with the active in the secondary datacenter. As such, it needs an additional step if we're just switching active traffic over and not simulating a complete failover:
pool restbase-async everywhere
sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_from restbase-async sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_to restbase-async
depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.
sudo cookbook sre.discovery.service-route --reason T123456 depool --wipe-cache $dc_to restbase-async
When simulating a complete failover, keep restbase pooled in $dc_to for as long as possible to test capacity, then switch it to $dc_from by using the above procedure.
As it is async, we trade the added latency from running it in the secondary datacenter for the lightened load on the primary datacenter's appservers.
These services require manual changes to be switched over and have not yet been included in service::catalog
- The DNS discovery name planet.discovery.wmnet needs to be switched from one backend to another as in example change gerrit:891369. No other change is needed.
noc.wikimedia.orgThis is no longer applicable as of September 2023, noc.wikimedia.org is now active/active in mw-on-k8s.
Main document: MariaDB/Switch Datacenter
Predictable, Recurring Switchovers
A few months after the Switchback of 2023, and following a feedback gathering process, a proposal to move to a predictable set of dates for the dates while also increasing the Switchover duration to 6 months was adopted and turned into a process. The document can be found in the link below:
See Switch Datacenter/Switchover Dates for a pre-calculated list up to 2050
- Services + Traffic: Tuesday, September 19th, 2023 14:00 UTC
- MediaWiki: Wednesday, September 20th, 2023 14:00 UTC
- Services: Tuesday, February 28th, 2023 14:00 UTC
- Traffic: Tuesday, February 28th, 2023 15:00 UTC
- MediaWiki: Wednesday, March 1st, 2023 14:00 UTC
- Read only: 1 minute 59 seconds
- Services: Tuesday, April 25th, 2023 14:00 UTC
- Traffic: March 14 2023
- MediaWiki: Wednesday, April 26th 2023 14:00 UTC
- Read only: 3 minutes 1 second
- Services: Monday, June 28th, 2021 14:00 UTC
- Traffic: Monday, June 28th, 2021 15:00 UTC
- MediaWiki: Tuesday, June 29th, 2021 14:00 UTC
- June 2021 Data Center Switchover on the Wikimedia Tech blog
- Services and Traffic on wikitech-l
- MediaWiki on wikitech-l
- Incident documentation/2021-06-29 trwikivoyage primary db
- Read only duration: 1 minute 57 seconds
- Services: Monday, Sept 13th 14:00 UTC
- Traffic: Monday, Sept 13th 15:00 UTC
- MediaWiki: Tuesday, Sept 14th 14:00 UTC
- Datacenter switchover recap on wikitech-l
- Read only duration: 2 minutes 42 seconds
- Services: Monday, August 31st, 2020 14:00 UTC
- Traffic: Monday, August 31st, 2020 15:00 UTC
- MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
- Incident documentation/2020-09-01 data-center-switchover
- Read only duration: 2 minutes 49 seconds
- Traffic: Thursday, September 17th, 2020 17:00 UTC
- MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
- Services: Wednesday, October 28th, 2020 14:00 UTC
- Services: Tuesday, September 11th 2018 14:30 UTC
- Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
- Traffic: Tuesday, September 11th 2018 19:00 UTC
- MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
- Datacenter Switchover recap
- Read only duration: 7 minutes 34 seconds
- Traffic: Wednesday, October 10th 2018 09:00 UTC
- MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
- Services: Thursday, October 11th 2018 14:30 UTC
- Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
- Datacenter Switchback recap
- Read only duration: 4 minutes 41 seconds
- Elasticsearch: elasticsearch is automatically following mediawiki switch
- Services: Tuesday, April 18th 2017 14:30 UTC
- Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
- Traffic: Tuesday, April 18th 2017 19:00 UTC
- MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
- Deployment server: Wednesday, April 19th 2017 16:00 UTC
- Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
- MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
- Elasticsearch: elasticsearch is automatically following mediawiki switch
- Services: Thursday, May 4th 2017 14:30 UTC
- Swift: Thursday, May 4th 2017 15:30 UTC
- Deployment server: Thursday, May 4th 2017 16:00 UTC
- Incident documentation/2017-05-03 missing index
- Incident documentation/2017-05-03 x1 outage
- Read only duration: 13 minutes
- Deployment server: Wednesday, January 20th 2016
- Traffic: Thursday, March 10th 2016
- MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
- Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
- Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
- Services: Monday, April 18th 2016, 10:00 UTC
- MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- Wikimedia failover test on Wikimedia Blog
- MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
Aggregated list of interesting dashboards