Etcd/Main cluster

The main etcd cluster is the Etcd cluster used as a state management system for the WMF production cluster. It is operated by SRE Service Ops under the etcd main cluster SLO.

Usage in production

More and more systems depend on etcd for retrieving state information. All current uses are listed in the table below

software	use	connection	interval	failure mode
pybal/LVS	retrieve LB pools servers lists, weights, state	custom python/twisted, host only	watch	will keep working until restart
varnish/traffic	retrieve list of backend servers; retrieve VCL fragments (requestctl)	confd (watch)	watch	will keep working
gdnsd/auth dns	Write admin state files for discovery.wmnet records	confd	watch	will keep working
scap/deployment	Dsh lists	confd	60 s	will keep working
MediaWiki	fetch some config variables	PHP connection, request at intervals	10 s	will keep working until restart
Icinga servers	Update a local cache of the last modified index to be used by other checks	cURL	30 s	the checks will use stale data for comparison

In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.

Architecture

The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a replica one.

Consistency

For reads that don't require sub-second consistency cluster-wide, reading from the replica cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the replica cluster is in read-only mode for remote clients to avoid issues.

Replication

Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that etcdctl mirror-maker provides on etcd 3 to etcd 2 clusters.

Etcdmirror runs from one machine on the replica cluster (see the profile::etcd::replication::active hiera key). It reads the etcd index to replicate from in /__replication/$destination_prefix (or, if $destination_prefix is the root of the replica cluster keyspace /, to /__replication/__ROOT__), issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicates every write that happens under $source_prefix in the source cluster.

As of April 2024 (T358636), we're replicating nearly the entire keyspace (i.e., / to /), including /conftool (conftool state) and /spicerack (spicerack lock state). One notable exception is /spicerack/locks/etcd which contains short-lived python-etcd lock state that isn't meaningful outside of the source cluster, and is thus ignored by replication.

The logs produced by etcdmirror are pretty verbose, detailing each replication event and any errors encountered should anything go wrong.

Recovering from replication failures

In the event that etcdmirror fails (indicated by the EtcdReplicationDown alert), it should be safe to try restarting the systemd unit if logs suggest a transient issue - e.g., connectivity to the source cluster.

However, etcdmirror is very strict when applying operations to the destination cluster and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events (i.e., when the latest event we've been able to replicate falls outside the 1000 event retention window at the source etcd cluster; see the note in this section of the etcd API docs).

In such a case, you will need to do a full reload. To do that, you need to launch etcdmirror with identical arguments to those used by the systemd unit, but adding the --reload switch. There is a shell script available in /usr/local/sbin on the replication host, which does this for you (look for reload-etdmirror). Once the reload is complete (look for "Starting replication at" in the logs), you can stop your manual invocation of etcdmirror and restart the systemd unit.

Beware: doing so will ERASE ALL DATA on the destination cluster, so do that with extreme caution.

Individual cluster configuration

We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize.

Instead, what happens is that on every cluster member we have an etcd instance listening for client connections on https://$fqdn:2379 with no authentication, but inaccessible to external connections (firewall rules). So local clients, such as etcdmirror, can write to it unauthenticated. At the same time, etcd advertises https://$fqdn:4001 as its client URL, which is also where nginx is listening for external connections and enforces authentication as well.

This can be surprising in exceptional cases when you need to directly modify keys in a cluster where nginx is enforcing read-only mode. If you attempt to do so with etcdctl --endpoints https://$fqdn:2379, it will still use the advertised URLs and writes will be rejected by nginx.

To work around this, you can use curl to issue the equivalent etcd v2 API calls against https://$fqdn:2379. Again, directly modifying keys in an exceptional operation, so consider getting your commands reviewed by a peer.

Operations

For the most part, you can refer to what is written in Etcd, but there are a few more operations regarding replication that are not covered there.

Master cluster switchover

From https://phabricator.wikimedia.org/T166552

Play-by-play:

These instructions assume the primary cluster is currently codfw and moving to eqiad. When moving in the opposite direction, swap the data centers accordingly in each step.

Reduce the TTL for conftool SRV records to 10 seconds
On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
Start read-only in the dc we're switching from (https://gerrit.wikimedia.org/r/356138)
sudo cumin A:conf-codfw 'run-puppet-agent' (begins read-only)
Verify that etcd is read-only by attempting to depool a server with conftool; it should fail.
To avoid being paged for etcdmirror replication delay, visit icinga and downtime the "Etcd replication lag #page" service.
sudo cumin A:conf 'disable-puppet "etcd replication switchover"'
Stop replication in the dc we're switching to (https://gerrit.wikimedia.org/r/#/c/356139)
sudo cumin 'A:conf-eqiad' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
Switch the conftool SRV record for read-write access to the dc we're switching to, updating the port if necessary (https://gerrit.wikimedia.org/r/#/c/356136/)
On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
sudo cumin A:conf-codfw 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
Set the dc we're switching to as read-write (https://gerrit.wikimedia.org/r/356341)
sudo cumin A:conf-eqiad 'run-puppet-agent' (ends read-only)
Verify that etcd is read-write again by depooling and repooling a server with conftool; this time it should succeed.
Verify that etcdmirror is replicating correctly by tailing /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log in codfw; you should see updates corresponding to the depool and repool in the last step.
Restore the TTL to 5 minutes https://gerrit.wikimedia.org/r/#/c/356137/
On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update

Reimage cluster

Steps to reimage a conf cluster step by step (in this case conf2 in codfw at the example of doing conf2004).

Be aware that this might not reflect current reality! LVS hosts might have changed, their role might have changed ... actually everything might have changed. So please use this as a starting template and DOUBLE CHECK EVERY STEP BEFORE YOU EVEN START!

Change SRV client record to point to the other cluster gerrit
Update authdns

ssh dns1004.wikimedia.org "sudo -i authdns-update"

Restart all confd instances and navtimint to pick upt he new DNS records

# batched restart of confd 
sudo cumin -b 50 -s 20 'C:confd' 'systemctl restart confd' 
# and navtiming 
sudo cumin webperf2003.codfw.wmnet 'systemctl restart navtiming.service'

Make Pybal use the other cluster gerrit

sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'run-puppet-agent' 
# check lvs config 
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'grep conf1 /etc/pybal/pybal.conf || true' 
 
# LOG TO SAL 
# restart pybal on secondaries 
sudo cumin 'lvs2014.codfw.wmnet,lvs5006.eqsin.wmnet,lvs4010.ulsfo.wmnet' 'systemctl restart pybal' 
 
# LOG TO SAL 
#restart pybal on primaries 
sudo cumin -b 1 -s 5 'lvs201[1-3].codfw.wmnet,lvs500[4-5].eqsin.wmnet,lvs400[8-9].ulsfo.wmnet' 'systemctl restart pybal'

Ensure nothing uses etcd on conf2*
- Check /var/log/nginx/etc*access.log
- Check ss -apn | grep 4001
Reimage

# reimage 
sudo cookbook sre.hosts.reimage --os bullseye -t T332010 conf2004

Delete and re-add the etcd member from the etcd cluster

# Get the member-id of conf2004 
etcdctl -C https://$(hostname -f):2379 member list 
# Using etcdctl does not work here because it will use tcp/4001 as client port which will block writing to /v2/member 
curl -X DELETE https://$(hostname -f):2379/v2/members/<MEMBER-ID> 
curl -X POST https://$(hostname -f):2379/v2/members -H "Content-Type: application/json" -d '{"peerURLs":["https://conf2004.codfw.wmnet:2380"]}'

Restart etcd on the reimaged host with ETCD_INITIAL_CLUSTER_STATE="existing"

systemctl stop etcd 
source /etc/default/etcd 
rm -rf ${ETCD_DATA_DIR}/* 
sed -i 's/ETCD_INITIAL_CLUSTER_STATE.*/ETCD_INITIAL_CLUSTER_STATE="existing"/' -i /etc/default/etcd 
systemctl start etcd 
run-puppet-agent

Ensure etcd and zookeeper are happy before going to the next one

echo ruok | nc localhost 2181; echo; echo stats | nc localhost 2181; echo; etcdctl -C https://$(hostname -f):2379 cluster-health