Switch Datacenter/Recurring, Equinox-based, Data Center Switchovers

From Wikitech
Jump to navigation Jump to search
This proposal has been adopted by the WMF and is now a process

Site Reliability Engineering will, starting September 2023, run a data center Switchover every 6 months, in the week of the solar Equinox, namely the work weeks containing March 21st and September 21st. If you are interested in learning more about Switchovers and why we perform them or already know what they are and want to learn more about how this proposal would impact your workflows or the Wikimedia Movement, please read on. Your comments will be very much appreciated. Make sure to also check out the FAQ at the end of the doc!

Introduction

Purpose of this proposal

The purpose of this document is to establish a process that sets standard, predictable, recurring points in time for when a data center Switchover will happen. This is in contrast to what we previously had, where a team negotiated with others the dates of the next Switchover, based on everyone’s availability. We also provide both an overview as well as a detailed account of the proposed process, explain the background and reasons for this proposal and outline the anticipated benefits.

What is not the purpose of this proposal

What is not in the scope of this proposal is to discuss the scope of each Switchover, i.e. whether component X should be included. That decision remains strictly in the hands of the teams owning that component.

Overview

The following is an overview of what is being proposed. For information as to what data center Switchovers are, please refer to Background”. For more details about the proposed process, please refer to Details

  • Data center Switchovers will take place on the work week of the Solar Equinox. For the purposes of this proposal, we assume the Northward Solar Equinox happens on March 21st and the Southward Solar Equinox on September 21st. This doesn't match exactly the astronomical event, that's on purpose, please see the detailed description below for more information.
  • The read-only part of Switchover, aka MediaWiki Switchover, will be happening always on the Wednesday of the above-mentioned week. During read-only, which is around 2 to 3 minutes, all Wikis will not be accepting edits and editors will see a warning message asking to try again later. Read-only starts at 14:00 UTC. Readers should experience no changes for the entirety of the event.
  • Various non-read-only parts of the Switchover will always take place on the Tuesday before the read-only part of the Switchover. There is no set time of the day for this one, contrary to the one above, as it is non-disruptive and with much lower risk.
  • For the next 7 calendar days after the read-only phase of the Switchover, traffic will be flowing solely to one of the 2 data centers, effectively rendering the other data center inactive.
  • On the Wednesday following the read-only phase of the Switchover, that is right after exactly 7 days, traffic will start flowing, in the normal Multi-DC way, to both data centers.
  • The concept of a Switchback, namely when we route all Wiki edits traffic back to our Virginia data center (eqiad), will cease to exist. The 2 data centers will be considered coequal, alternating roles every, roughly, 6 months.

A rough diagram of the key points in the process is presented below:

High level state diagram of Switchover steps
High level state diagram of Switchover steps

Benefits

Below we document a set of benefits from the establishment of this process to a variety of teams across the Foundation.

  • The Switchover date becomes predictable: Every entity, whether they are a WMF team, a chapter/affiliate or just a volunteer that is somehow affected, whether that is by being actually responsible for actions before, during or after the Switchover, using the Switchover or some consequence of it to facilitate parts of their work or just merely interested in it, will know about when it is going to happen and plan accordingly.
  • The duration a Wikimedia data center receives read/write Wiki traffic becomes predictable: Every entity, whether they are a WMF team, a chapter/affiliate or just a volunteer that is somehow affected, whether that is by being actually responsible for actions before, during or after the Switchover, using the Switchover or some consequence of it to facilitate parts of their work or just merely interested in it will know for how long Wiki edits traffic is going to flow to a data center and plan accordingly.

From the aforementioned, broad in nature, we can identify several specific advantages that we list below. Note that the list is not exhaustive.

  • Communications to the movement becomes predictable, easy to schedule well in advance, repeating in nature, allowing to creation of templated communications in a programmatic way. Furthermore, updates to the movement can always point to the exact same page. This page, translated into multiple languages, explains the reasons for this operation. It is linked to any announcement made (Tech News, CentralNotice banner, etc.). It can be bookmarked and referenced by anyone, stating the exact time of the next Switchover. The update of the date and time on that page could happen via automated means.
  • Teams on which a successful Switchover depends on, e.g., Data Persistence, Service Operations, Community Relations, Search, can proactively and with a 6-month advance notice, schedule the work that is a prerequisite for the Switchover.
  • Teams benefiting from the Switchover, e.g. Data Engineering, Data Persistence, Infrastructure Foundations, Machine Learning, Service Operations can utilize the 6-month period one of the data centers doesn't receive Wiki edits traffic to schedule disruptive maintenance work in it.
  • We can cultivate a culture of living with a data center Switchover making it a routine process that the entire movement is used to, educating newcomers as a process that happens alongside everything else that the movement has established. One interesting thing to point out is that if we start this process when proposed (September 2023), the Southward equinox will coincide with the move to Dallas, Texas (codfw). This is geographically southern than Virginia (eqiad), something that might help, at least U.S. residents, to remember the direction of the move.

Drawbacks

No new process comes without some drawbacks. We have identified the following:

  • Since we will be doing more Switchovers per year than what we have been doing up to now, we expect an initial higher cost as teams and the movement adjust. This is expected to be mostly around coordination as well as required technical in nature actions. For the former, the process will help as coordinating actions are going to be settling into well-known and agreed-upon practices. For the latter, we expect to invest even more in automation to lower the overall cost.
  • The removal of the Switchback step will mean that some assumptions made in the past, namely that we would never be more than a couple of months in the Texas (codfw) data center, will no longer apply. This spans everything from the state of our data centers to the latency some users have grown accustomed to expect. We expect this to wither out as the new culture of doing this twice per year evolves and the entire movement adapts.
  • Readers, due to our global CDN’s caching capacity, aren’t expected to be impacted substantially, if at all, by this. Editors, however, depending on the type of action they undertake and whether we are in Multi-DC mode or not, are expected to experience some performance differences. For the 7-day period, Multi-DC is inactive, those that are situated geographically closer to the active data center will see somewhat better performance whereas editors geographically further away, worse. When Multi-DC is active, only a small subset of their actions (the action of submitting an edit specifically) will experience worse performance. The difference, which we expect to be in the mid-tens of milliseconds, is low enough to probably not be perceptible by the vast majority. There’s some work (edge content composition) scheduled, that would address some parts of this, due to the architectural changes it will incur.

Background

Data center Switchovers are a standard response to certain types of disasters. Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between data centers also makes it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from data center A, we can do all that work at data center B. A Wikimedia data center Switchover, from our Virginia data center (eqiad) to our Texas data center (codfw) or vice-versa comprises of switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This process, in the Wikimedia environment, is called, unsurprisingly data center Switchover

Past switchovers

We have been doing successful Switchovers since 2016. The long detailed history is documented at Upcoming and Past Switches but the summary is that we have been managing to make them safer, faster and more efficient on every run, while also increasing their scope. Communities’ awareness of the Switchovers increased with time due to the collaboration with the Community Relation Specialists team, and they are now part of the maintenance routine, even if infrequently.

Current challenges

The process has always required preparation in advance. While the work needed for every Switchover now is way less than in the past, there have always been changes in the infrastructure that require at least one human running preparatory work. A big part of it, usually done at least 10 weeks before, has been to coordinate with multiple stakeholders across the organization, decide on dates and communicate the plan. Our proposal does away with this step, making clear to every stakeholder when this will happen.

Scheduling a Switchover requires figuring out the best point in time to run it. It involves usually an arbitrary number of constraints, most revolving around scheduling and planning:

  • Scheduling it during the Fundraising Season is a no-go
  • Scheduling it right before the end of the calendar year or right after the calendar year is risky as multiple key people might be on vacations
  • It's best to avoid scheduling it in July/August, as the risk of having multiple key people on vacation is high
  • Easter is the canonical movable feast, which can complicate scheduling

The process needs to be run in a recurring fashion, to ensure that our tooling and procedure is always up to date and ready for an emergency. For that reason, we have been doing it once a year, effectively doing it twice in the span of a few weeks (Switchover and then Switchback). We've long said that we should make it more predictable and recurring but just around the time we were planning to make that happen, COVID-19 struck. We believe we are finally in a position again to work on this.

The process was originally devised while the SRE team was but a single team, but since the SRE team split, the Switchover has found a natural home in the Service Operations SRE team. Unfortunately, that means that running the process is heavily dependent on the capacity of said team. However, other teams' work is dependent (or at least greatly facilitated) by the Switchover. This causes bottlenecks and the need for cross-team coordination and negotiation on the exact dates that suit everyone involved.

Details

This section expands on the definitions given in the Overview in order to tackle the inevitable edge cases that the astute reader has probably noticed.

The date

Given the above challenges/constraints, figuring out the exact date of the Switchover can be a difficult process. In order to avoid all this, we want to standardize around a set date.

Past experience has shown that we tend to schedule Switchover on dates that are a few weeks around the 3rd week of March or September anyway (yes, exceptions do exist, we know). This has been happening mostly for scheduling and planning reasons and stems from the need to have ample time before and after the Switchover.

We felt that picking something memorable that doesn't particularly change across cultures, countries, hemispheres, jurisdictions, etc. would allow more people to remember it and relate to it, making it more fitting to our Global Movement. Human-made things tend to vary a lot and have different connotations (including bad ones) across cultures, so we settled on something that has been close to a constant for humankind since time immemorial. That's an astronomical event that has been quite predictable by humankind for millennia and it's the Solar Equinox

If you click on that link and read the first paragraph, you already read that This occurs twice each year, around 20 March and 23 September. For the rest of this document, we ask you to not delve more into the details and say On the week of the 21st of March/September.

If you have delved ever into anything remotely astronomical related you are probably shouting by now "But the Solar Equinox doesn't even happen on the same date every year". There is a FAQ entry for you at the bottom of this document, please read on.

The week

Let's now look into the week. At first glance, this appears easy to define. In fact, we apparently already did a section above. We said On the week of the 21st of March/September. This section deals with the edge cases around that definition.

We tried to see what definition of the week we would use when a week starts and which exact day of the week is the best to run the Switchover on. Suffice to say, across the world, a week starts on a Monday in large parts of the world, a Sunday in other parts of the world, a Saturday in some other parts and apparently per English Wikipedia there is one country that starts the week on a Friday. Given the above, we abandoned the idea of the start of a week.

We then wondered about using the concept of the ISO Week for this. Given the complex mathematical formula in the middle of the linked above page, the author of this document severely doubts this would make things more helpful to anyone.

In the end, the answer was given by the choice of the day. We used to run the read-only part of the Switchover on Tuesdays and the other parts on the surrounding days. Things have changed since then and we now utilize only 2 days of the week. Furthermore, Mondays have a tendency to be an observed Holiday in some parts of the world, as well as an extra day off after the weekend. Thus we went for Wednesday for the read-only part, and Tuesday for the other parts.

So, to cut this short:

  • If the 21st is a Monday/Tuesday/Wednesday/Thursday/Friday, the Switchover happens on the Wednesday closest to the 21st. If Wednesday is the 21st, that's the date.
  • If the 21st is on a Saturday/Sunday, the Switchover happens on the Wednesday right after.

This way, the range of dates the read-only part of the Switchover can happen on, assuming UTC, is: 19th to 25th. This maximizes the chance it's close enough to the Equinox and will happen on a date starting with a "2" 6 out of 7 times, making it somewhat easier to remember.

The time

The read only part of Switchover, aka MediaWiki Switchover, will be happening always on the Wednesday of the above-mentioned week. For the duration of that part, which is expected to be around 2 to 3 minutes, all Wikis will not be able to accept edits and editors will see a warning message telling them to try again later. Readers, which is the vast majority of our traffic, will continue to be able to read articles as usual. This part will start at 14:00 UTC. Don't expect up to the second accuracy, but we will do our best to be on schedule.

Various non read-only parts of the Switchover will happen on the Tuesday before the read-only part of the Switchover. There is no set time of the day for this one, contrary to the one above, as it is non-disruptive and with much lower risk. That being said, we will be optimistically targeting 14:00 UTC as well.

Auxiliary components of the infrastructure (e.g. Gitlab, Planet, People) might be moved after the Switchover. Some of these components are newcomers to the Switchover process and, for now, don't need to adhere to the schedule detailed above. They will issue their own announcements as deemed necessary. This more relaxed stance is expected to be kept for all new components opting to participate in the Switchover, allowing them to evolve their processes at their own pace. Long term, we expect them to be part of the usual Switchover process.

The duration

This is the section where most of the changes from the current status quo are described.

Single data center duration

Up to now, we've been following a rough plan of staying for an entire week with the data center we moved away from completely drained of all traffic. The reasoning behind this is to make sure that the data center receiving all the traffic can survive an entire week's traffic patterns. A secondary purpose is to provide some teams, predominantly (if not exclusively) in SRE, with a clearly defined and agreed upon time frame to perform otherwise risky maintenance work in the drained of all traffic data center. This part remains the same. For the 7 days following a Switchover, we will be in a single data center. On the Wednesday following the Switchover, we will be reverting back to Multi-DC, our normal mode of operation.

Total data center switchover duration

What has varied greatly in the past was the amount of time that we would stay in what we always called "the secondary data center". Per our Past Switchovers we have been doing anything between 2 and 11 weeks. We started off with a small duration, progressively increasing the time frame as things evolved. We also anticipated and scheduled well in advance what we called internally the "Switchback". There have been a variety of contributing factors for this, ranging from cultural ones to technical ones.

In the cultural category, we can probably point out to having a want that the person running point on the Switchover would run point on the "Switchback". This stemmed mostly from the increased communication needs of the process back then (it was an entire project and thus the human interpersonal relationships they would create during it were deemed important. Also, to many of us, our data center in Virginia (eqiad), felt (possibly still feels) like home.

In the technical category, we can probably point out that we did not want Virginia (eqiad) to become too "cold" to handle our usual traffic patterns and that some workloads suffered from extra latency when traffic was flowing exclusively to Texas (codfw).

The landscape depicted above has changed and things have evolved. The process is straightforward and automated enough by now that we already had a Switchover and a Switchback run by different people in 2021. SRE finds that the need to heavily coordinate with other teams before performing one has diminished substantially (few teams outside SRE are a stakeholder by now). Multi-DC has alleviated the technical issues.

We have found ourselves again and again in a discussion that we should be doing this more often, ideally every few weeks without having to coordinate or give advance notice to anyone. This would anyway be the case in a real emergency. We are still quite far away from this dream, but we are in a position now that we can do this often enough that it becomes a routine.

Without further ado:

  • We will be letting go of the idea of the "Switchback"
  • We will only be doing "Switchovers" between coequal data centers.
  • We will be performing a Switchover, every, roughly, 6 months

Amendments to the plan

No plan survives contact with reality intact. If SRE discovers a need to alter the plan in some way, they will inform relevant stakeholders about the changes.

Implementation plan

Timeline for implementing the new process

All dates are in 2023, unless differently noted:

  • Week of May 15th – Review by SRE Service Operations
  • Week of May 22nd – Review by all of SRE
  • Week of June 6th and June 12th – Review by all of Product + Technology
  • Week of June 19th – Address and incorporate input
  • Week of June 26th – Announce to the org
  • Week of August 28th – Announce to the movement
  • Next 4 weeks – Implementation of an automatically (via a bot) updated Wiki page informing everyone when the next Switchover will take place
  • Late August – pick a point person for the September 2023 Switchover, possibly for a couple of the subsequent ones as well.
  • Week of September 21st – First Switchover, from Virginia (eqiad) to Texas (codfw) with the new process in place
  • Week of March 21st 2024 – Second Switchover, from Texas (codfw) to Virginia (eqiad) with the new process in place
  • October 2025 – After 4 Switchovers, evaluation of the new process.

Roles and responsibilities of stakeholders

The following teams are the current stakeholders of the process:

  • SRE Service Operations: Process owners,, implement the automation required, coordinate with all other teams, and are ultimately responsible for a smooth switchover
  • SRE Data persistence: Ensure that the data store layer for MediaWiki, namely our Relational Databases are ready for the Switchover.
  • SRE Traffic:Ensure that the Content Delivery Network (CDN) layer is properly configured when pooling and de-pooling data centers
  • Community Relations: Handles the communications to the movement, announcing the Switchover (and this change) in advance.
  • SRE Collaboration: They own a set of infrastructure components that are being onboarded on the Switchover process, via a “delegated cookbook” approach, increasing the scope of the Switchover.

There is also a non-exhaustive list of teams that are impacted by this and might need to perform actions to react to it. This list will probably evolve over time

  • SRE: The Switchover has the potential to end up in an all-hands-on-deck situation. This is well known and accepted already. Furthermore, all SREs MUST adhere to the Switchover scheduling and pause day-to-day operations that could interfere with the Switchover e.g. Puppet merges, (re-)imaging servers, deploying, etc. The duration of that pause is expected to be less than 1 hour and the premise is to reduce risk and increase the signal-to-noise ratio of the alerting systems, making the work of the person running the Switchover, easier.
  • SRE-at-large+people with root access: This is a larger subset that includes all of the SRE team plus SREs in teams that have embedded SREs. The same rules apply as above. The exception is that this set of people is not on-call and, even though they are welcome to help in an emergency, it's not part of their duties.
  • Release Engineering: During the read-only part of the Switchover, the MediaWiki train, owned by this team, MUST NOT run. It can still run on the same day, subject to their (and the SRE team's) discretion. The Deployment Calendar, also owned by this team, should contain informative entries about the Switchover, for the benefit of all deployers
  • SRE Infrastructure Foundations: They might be needed to review parts of code (e.g. cookbooks) that automate the Switchover.
  • Various team owning eqiad-only services: Some services, e.g. Wikitech, parts of Toolforge, etc. will be experiencing somewhat higher latencies for calls to MediaWiki R/W endpoints for half the year.
  • Deployers+Maintenance script/job runners: Deployment and maintenance hosts are also being switched over during this process. People/Teams relying on them will be affected by this as they might end up trying to use the wrong host and get informed by the platform. Experience has shown that it is more of an annoyance than an actual problem, nonetheless, we think it’s worthwhile to note it. While the nature of this is going to evolve in the future, it will probably exist in some form for a long time.

Communication plan for introducing the new process to the movement

  • Timeline is already detailed above. After feedback is incorporated and announcement to the org is done, Service Operations will cooperate with Community Relations to announce the change to the wider movement. Community Relations has already been announcing the past Switchovers, we expect to use the same channels. The exact wording will be worked out by the two teams.

Evaluation

Metrics to evaluate the success of the new process

We are already tracking Switchover performance. Data is at Past Switchovers. We will continue tracking that performance and documenting it in the above page in the same manner. We do not expect to see any significant difference.

What we do expect to change is the need to announce the Switchover widely, the need to coordinate with teams and the incoming requests to the Service Operations asking when a Switchover will happen. We have been always keeping track of these, in a semi-structured way via Phabricator. We will continue doing so and evaluate after 4 iterations of this whether this has led to fewer requests

The other thing we do expect to change is the rate of resolving Switchover-related follow-up tasks to become more stable. Right now, it's erratic as many get filled and resolved after a Switchover and some are resolved in preparation for the next Switchover. We 've been tracking those via Phabricator and we will continue to do so.

Conclusion

To reiterate the main benefits of this change:

  • The Switchover date becomes predictable
  • The duration a Wikimedia data center receives read/write Wiki traffic becomes predictable

We'd also reiterate that we'd like to start this in September 2023 and ask for every stakeholder's input.

FAQ

That week is “Daylight Confusion Time/Week”. Will that matter?

Daylight Confusion Time is an informal, Wikimedia Movement, term. It describes the discrepancy that stems from the fact that Daylight Saving Time is not observed at the same time in all countries/states that apply it. For instance, when Europe and North America apply it, they do it with a two-week offset. The situation varies even more across jurisdictions and even governmental decisions. It can be a confusing time for people scheduling events across different time zones as their usual assumptions don’t apply. Since this process has only one event strictly time-defined, namely the read-only part, that is set at a UTC timezone (which doesn’t have DST), we don’t expect any problems from this. If anything, the smaller time zone difference between some countries might help with coverage. To dispel any further fears that this might cause issues, we'd like to note the following:

  1. We've already done multiple Switchovers spanning Daylight Confusion Time. As far as we know, we haven't experienced anything related to time zones.
    • 2023 – switchover contained the entirety of Daylight Confusion Time
    • 2020 – DST ended in the EU on Oct 25th, we switched back on Oct 27th, you can't do more DCT than that.
    • 2016 – The initial one, full of unknowns, was literally tested on March 15th, right in the middle of DCT.
  2. All our computer systems have always been following UTC (Universal Coordinated Time). UTC does not have Daylight Saving Time, so software glitches related to time zone changes don’t affect it. No system in our infra will implicitly get some other TZ. It requires explicit behavior to do so. Even more, no code that does not on purpose deal with timezones will end up dealing with > 1 TZs, which is where this concern stems from anyway.
  3. As we grow across the globe and the movement becomes more and more global, time zone differences become more and more a thing and thus, invariably a part of our culture. Thus, the concern, as well as the term should apply less and less gradually and we don’t want to stitch such a term into our process.

What's the deal with the Solar Equinox?

The author (Alexandros) likes astronomical phenomena!

On a more serious note, past experience has shown that we tend to schedule Switchover on dates that are a few weeks around the Equinox anyway. This is mostly for scheduling and planning reasons and stems from the need to have ample time before and after the Switchover. Picking something memorable to relate the Switchover to, felt like a plus. Furthermore, in mid-latitudes, the Solar Equinox marks the beginning of Spring/Autumn, which we hope is another way some people might be remembering this.

Your dates for the Solar Equinox are wrong!

We know! Let's just pretend for the sake of everyone involved, that these events happen on March 21st and September 21st.

For those who are actually interested in when these events happen, the answer is "when the Sun crosses the Earth's equator" which is shorthand for "Not the same date and time every year".

For the current decade, the Northward Equinox happens on March 20th, assuming the observer is on UTC. If the observer is not on UTC, it depends. Could be the 19th or the 21st. The Southward Equinox, assuming UTC, oscillates between September 21nd and September 23rd for the current decade. If the observer is not on UTC, we are talking about anything between the 20th and the 24th, depending on which time zone they are in. You can take a peek at the table in Equinox if you are interested. If you start wondering what will happen past this decade, please see Solstices and Equinoxes: 2001 to 2100

Instead of all of the above, let's just say on the 21st and call it a day. Why 21? Because (20+22)/2 = 21. Also, your interest is duly noted and you probably have made a friend. Chat with Alexandros, they like you already.

Appendix

Switchover diagram Graphviz code

digraph switchover {

   size="7,8";

   node [fontsize=24, shape = plaintext];

   "Prep work" -> "Tuesday";

   "Tuesday" -> "Wednesday 14:00 UTC";

   "Wednesday 14:00 UTC" -> "7 days later";

   node [fontsize=20, shape = box];

   { rank=same;  "Tuesday" "MW-adjacent Services" "CDN Traffic"; }

   { rank=same;  "Wednesday 14:00 UTC" "Read-only"; }

   { rank=same;  "7 days later" "Multi-DC"; }

   "MW-adjacent Services" -> "Read-only";

   "CDN Traffic"  -> "Read-only";

   "Read-only" -> "Multi-DC";

}

Pre-calculated Switchover dates up to 2050

See Switch Datacenter/Switchover Dates