GitLab/Failover
GitLab has a active host and one or more replicas. The replicas are cold-standby currently, meaning they don't serve any production traffic and hold up to 24h old data. For maintenance or in case of emergency it is possible to failover the active host to a replica. This page describes the process broadly.
![]() | This process is not automated and can cause data loss! |
The process takes around 1h to 1:30h (depending on backup size). During that time GitLab is not available.
Prerequisites
The host to failover to should be a proper GitLab replica, meaning:
- has a second IPv4 and IPv6 address configured as
profile::gitlab::service_ip_v4
andprofile::gitlab::service_ip_v6
- is running the puppet
role(gitlab)
- has enough disk space
Planned Failover
A planned failover means the old production instance is responding and working properly and doing a recent backup is possible. There is no data loss. The following steps are needed to failover to a new host.
Before failover
- copy ssh host keys for
/etc/ssh-gitlab
daemon from old host to new host- this can be done from a cumin host using:
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 <host>.wikimedia.org:/etc/ssh-gitlab/ <host>.wikimedia.org:/etc/ssh-gitlab/
- this can be done from a cumin host using:
- make sure you have access to the new instance (should be in place already, because accounts and tokens are similar to production instance):
- and can login to the instance
- you have admin privileges and you created a personal access token
- apply gitlab-settings to new host (done for all replicas)
lower TTL for gitlab.wikimedia.org (example change 802090)not needed anymore because restore takes more than 10m?- announce downtime some days ahead on engineering-all, #wikimedia-gitlab`
During failover
- pause all GitLab Runners
- stop puppet on old host with
sudo disable-puppet "Failover in progress"
- stop write access on nginx and ssh-gitlab on old host with
gitlab-ctl stop nginx
andsystemctl stop ssh-gitlab
- create full backup on old host:
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1" && ls -t "/srv/gitlab-backup"/*gitlab_backup.tar | head -n1 | xargs -i cp {} "/srv/gitlab-backup"/latest/latest-data.tar
- sync backup, on to new host:
/usr/bin/rsync -avp /srv/gitlab-backup/latest/ rsync://<NEW_HOST>.wikimedia.org/data-backup
- configure new host with
profile::gitlab::service_name: 'gitlab.wikimedia.org'
(example change 802150)
- configure new host in
profile::gitlab::active_host
(example change 802150) - trigger restore on new host
/srv/gitlab-backup/gitlab-restore.sh
- overwrite home_page_url. on new host, run
echo "ApplicationSetting.last.update(home_page_url: 'https://gitlab.wikimedia.org/explore')" | /usr/bin/gitlab-rails console
- Point DNS entry for `gitlab.wikimedia.org` to new host (example change 802473) and run
authdns-update
- verify installation (login, push, pull, look at metrics)
- run puppet on new host
- enable puppet on old host with
sudo enable-puppet "Failover in progress"
- unpause all GitLab Runners
- announce end of downtime
Unplanned Failover
A unplanned failover means the old production instance is not responding/lost and it is not possible to create a backup is possible. There is up to 24 hours of data loss GitLab.
Get as new data as possible
Check the age of the backup in bacula and on the existing replicas. If the backup is reasonably new, use this backup (make sure to check GitLab/Backup and Restore#Fetch backups from bacula). If that backup is too old, try to manually schedule a database dump and rsync the git repositories. However this is not an automated step and needs more planning.
During failover
The following steps assume that the old host is not available anymore and a replica with the most recent ("latest") backup is used to failover:
- configure new host with
profile::gitlab::service_name: 'gitlab.wikimedia.org'
(example change 802150)
- configure new host in
profile::gitlab::active_host
(example change 802150) - if needed, trigger a restore on new host
/srv/gitlab-backup/gitlab-restore.sh
(not needed if new backup can't be created) - overwrite home_page_url. on new host, run
echo "ApplicationSetting.last.update(home_page_url: 'https://gitlab.wikimedia.org/explore')" | /usr/bin/gitlab-rails console
- Point DNS entry for `gitlab.wikimedia.org` to new host (example change 802473) and run
authdns-update
- verify installation (login, push, pull, look at metrics)
- run puppet on new host