VRT System/Failover

If there is sufficient time before the maintenance add it to next week's Tech News

VRTS has one active host (currently vtrs1001) and one replica (vrts2001).

Introduction

This process is only to be used when other courses of action have failed. Please confer with other SREs before running it

For some time now, VRTS has had only one primary server in eqiad (vrts1001). Failing over the system is not a very straight-forward process. For one the database is very huge thus backing it up and restoring it could potentially take days and secondly the database shares a section with other unrelated services so it would potentially mean having to failover those services as well which is somewhat of an inconvenience for everyone involved. So this guide is only to be used in cases where there is some maintenance work in eqiad that forces us to switch traffic to codfw (vrts2001) at least temporarily until services are restored in eqiad.

Prerequisites

The host to failover to should be a proper VRTS replica, meaning:

is running the puppet role(vrts)
has the same files as the primary in /opt. There is currently a rsync setup and can be run using sudo /usr/bin/rsync --rsh /usr/local/sbin/sync-vrts-ssl-wrapper -av --progress rsync://vrts1001.eqiad.wmnet/vrts /opt/
check for a fresh backup MariaDB/Backups#Dashboard

This is mostly already done for the currently configured replica vrts2001.

Planned Failover

A planned failover means the old production instance is responding and working properly. The following steps are needed to failover to a new host:

Log in with an admin account to the VRTS dashboard and schedule new system maintenance for when you plan to do the failover. This can be done from Admin -> System Maintenance. This is important as one of the critical things we have to try and ensure during a failover is that no one is writing to the database. Maintenance mode ensures that only admins can login to the system and this goes a long way in reducing the number of people actively using the system and we can easily inform admins to not perform any critical tasks during a failover.
Prepare DNS patch: In the DNS repo, open the wmnet template and change the record that ticket points to. This is under the "misc services without multiple backends section".
Ensure your new host is listed as the active_host in the hieradata/role/common/vrts.yaml file. Since there are only two hosts, you can just invert the values of active_host and passive_host.

Unplanned Failover

An unplanned failover means the old production instance is not responding/lost.

This page is a part of the SRE Collaboration Services technical documentation
(go here for a list of all our pages)