Toolforge Redis is running on three nodes (tools-redis-5/6/7). One of them is a master and rest of them are replicas. Keepalived makes sure a virtual IP address is always assigned to the master that the clients can connect to.

If the current master goes down, Redis Sentinel should notice that within five seconds and automatically fail over to a replica. It might take additional 10 seconds for the floating IP to move to the new master.

If Sentinel does not fail over to a new node (use redis-cli info replication to check), look into /var/log/redis/redis-sentinel.log on any alive node. If the IP address does not move, check sudo systemctl status keepalived and check that the /usr/local/bin/wmcs-check-redis-master script has exit code 0 on the master and 1 on the replicas.

Note that Sentinel requires a quorum to perform any actions - that means that it will not function with two nodes down. Additionally Redis has been configured to not accept any writes on the replicas or on the master if no replicas are connected.

Systemd unit

We are not using the default Systemd unit redis-server.service that comes with the Debian package. We are using a custom unit named redis-instance-tcp_6379.service that is deployed via Puppet.

Manual failover

If you need to force a failover or perform other Sentinel actions, you can connect to it using redis-cli on port 26379:

taavi@toolsbeta-redis-1:~$ redis-cli -p 26379>

Sentinel commands are listed at redis, use toolforge as the "master name".

The most useful command is sentinel failover toolforge which forces a failover to any other available node. You can alternatively add the IP address of the node to fail over to.

Puppet configuration

Puppet is configured to never update (replace => false) the config files /etc/redis/tcp_6379.conf and /etc/redis/sentinel-toolforge.conf, to prevent clashes with Redis Sentinel, which can also modify those files.

This means that if you change or add a config value in modules/profile/manifests/toolforge/redis_sentinel.pp, it will not end up in the actual config files, unless you manually remove them in all Redis server, and let Puppet recreate them.

# # Repeat on all Redis hosts, starting from the current primary
# rm /etc/redis/sentinel-toolforge.conf /etc/redis/tcp_6379.conf
# run-puppet-agent

The commands above will cause Redis to restart and likely cause a short Redis outage.

A possible improvement to the current setup is tracked in phab:T366365.

