Memcached for MediaWiki/Memcached server failure
Handle memcached server failure
Memcached servers are named mc1xxx in eqiad and mc2xxx in codfw. They run both memcached and redis.
On server failure Icinga alerts, but does not page.
When a mc1* server fails (or is taken down for maintenance) memcached accesses are rerouted automatically to the memcached gutter pool (server mc-gp*). This is a feature/configtration of mcrouter a proxy that we are using for accessing memcached. There are 18 mc servers and 3 mc-gp servers per data center. mc-gp servers are more powerful, 10 Gb network cards, 2x ram, as they are designed to deal with multiple mc servers failing. Dataloss is expected, as the cached content on the failed server is gone.
When a mc1* server fails (or is taken down for maintenance) redis accesses are rerouted/rehashed to another mc1* server. This is a feature/configuration of using nutcracker a proxy that we are using to access redis. Dataloss is expected as the keys hosted on the server are gone. (see T213089 for more info)
However for redis we have indications that our mediawiki application does not like this rehashing mechanism. In T272319 there is an analysis that the rehashing causes the error rate to go up and that reestablishing the original 18 server count manually causes the error rate to drop to normal levels. This behavior is unexpected but persisted during weeks from Jan 15 to Feb 25 2021. Cause seems to be connect ed to the eay we use redis and the attempts of nutcracker to reestablish connection to the failed server. See comments in T272319#6860424.
Workaround: Redis should be reconfigured to have a full set of 18 servers if a server is expected to be out for longer.
The file to change is hieradata/common/redis.yaml.
It has a map of redis instances, called shards to servers. 18 shards to 18 servers by Ip address and port - shard01... shard18
Find the shard that is on the failed server (search by IP) and substitute the IP address and port with a new working server/port combo. There could be more than 1! - if 2 servers have failed for example combined with some bad luck.
Each server runs one redis shard on port 6379. The change in the YAML file will cause a second shard to be spun up on the machine. Port 6380 is the habitual choice at access these failover shards.
Replication of redis to mc2* servers
mc2* servers replicate redis from mc1. To be usable in case of data center failover.
When establishing a new shard on an existing server on port 6380 puppet update sthe replica server as well.
so for example when mc1027 fails, mc2027 would be pointed to the new shard - see /etc/redis/replica/6379-state.conf
Example tasks:
mc1027 down - T276415
Frequent "Nonce already used" errors in scripts and tools - T272319
Example reroute:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/668258/1/hieradata/common/redis.yaml