Etcd/Main cluster
The main etcd cluster is the Etcd cluster used as a state management system for the WMF production cluster. It is operated by SRE Service Ops under the etcd main cluster SLO.
Usage in production
More and more systems depend on etcd for retrieving state information. All current uses are listed in the table below
software | use | connection | interval | failure mode |
---|---|---|---|---|
pybal/LVS | retrieve LB pools servers lists, weights, state | custom python/twisted, host only | watch | will keep working until restart |
varnish/traffic | retrieve list of backend servers; retrieve VCL fragments (requestctl) | confd (watch) | watch | will keep working |
gdnsd/auth dns | Write admin state files for discovery.wmnet records | confd | watch | will keep working |
scap/deployment | Dsh lists | confd | 60 s | will keep working |
MediaWiki | fetch some config variables | PHP connection, request at intervals | 10 s | will keep working until restart |
Icinga servers | Update a local cache of the last modified index to be used by other checks | cURL | 30 s | the checks will use stale data for comparison |
In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.
Architecture
The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a replica one.
Consistency
For reads that don't require sub-second consistency cluster-wide, reading from the replica cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the replica cluster is in read-only mode for remote clients to avoid issues.
Replication
Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that etcdctl mirror-maker
provides on etcd 3 to etcd 2 clusters.
Etcdmirror runs from one machine on the replica cluster (see the profile::etcd::replication::active
hiera key). It reads the etcd index to replicate from in /__replication/$destination_prefix
(or, if $destination_prefix
is the root of the replica cluster keyspace /
, to /__replication/__ROOT__
), issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicates every write that happens under $source_prefix
in the source cluster.
As of April 2024 (T358636), we're replicating nearly the entire keyspace (i.e., /
to /
), including /conftool
(conftool state) and /spicerack
(spicerack lock state). One notable exception is /spicerack/locks/etcd
which contains short-lived python-etcd lock state that isn't meaningful outside of the source cluster, and is thus ignored by replication.
The logs produced by etcdmirror are pretty verbose, detailing each replication event and any errors encountered should anything go wrong.
Recovering from replication failures
In the event that etcdmirror fails (indicated by the EtcdReplicationDown alert), it should be safe to try restarting the systemd unit if logs suggest a transient issue - e.g., connectivity to the source cluster.
However, etcdmirror is very strict when applying operations to the destination cluster and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events (i.e., when the latest event we've been able to replicate falls outside the 1000 event retention window at the source etcd cluster; see the note in this section of the etcd API docs).
In such a case, you will need to do a full reload. To do that, you need to launch etcdmirror with identical arguments to those used by the systemd unit, but adding the --reload
switch. There is a shell script available in /usr/local/sbin
on the replication host, which does this for you (look for reload-etdmirror
). Once the reload is complete (look for "Starting replication at" in the logs), you can stop your manual invocation of etcdmirror and restart the systemd unit.
Individual cluster configuration
We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize.
Instead, what happens is that on every cluster member we have an etcd instance listening for client connections on https://$fqdn:2379 with no authentication, but inaccessible to external connections (firewall rules). So local clients, such as etcdmirror, can write to it unauthenticated. At the same time, etcd advertises https://$fqdn:4001 as its client URL, which is also where nginx is listening for external connections and enforces authentication as well.
etcdctl --endpoints https://$fqdn:2379
, it will still use the advertised URLs and writes will be rejected by nginx.To work around this, you can use curl
to issue the equivalent etcd v2 API calls against https://$fqdn:2379. Again, directly modifying keys in an exceptional operation, so consider getting your commands reviewed by a peer.
Operations
For the most part, you can refer to what is written in Etcd, but there are a few more operations regarding replication that are not covered there.
Master cluster switchover
From https://phabricator.wikimedia.org/T166552
Play-by-play:
These instructions assume the primary cluster is currently codfw and moving to eqiad. When moving in the opposite direction, swap the data centers accordingly in each step.
- Reduce the TTL for conftool SRV records to 10 seconds
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
- Start read-only in the dc we're switching from (https://gerrit.wikimedia.org/r/356138)
- sudo cumin A:conf-codfw 'run-puppet-agent' (begins read-only)
- Verify that etcd is read-only by attempting to depool a server with conftool; it should fail.
- To avoid being paged for etcdmirror replication delay, visit icinga and downtime the "Etcd replication lag #page" service.
- sudo cumin A:conf 'disable-puppet "etcd replication switchover"'
- Stop replication in the dc we're switching to (https://gerrit.wikimedia.org/r/#/c/356139)
- sudo cumin 'A:conf-eqiad' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
- Switch the conftool SRV record for read-write access to the dc we're switching to, updating the port if necessary (https://gerrit.wikimedia.org/r/#/c/356136/)
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
- sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
- sudo cumin A:conf-codfw 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
- Set the dc we're switching to as read-write (https://gerrit.wikimedia.org/r/356341)
- sudo cumin A:conf-eqiad 'run-puppet-agent' (ends read-only)
- Verify that etcd is read-write again by depooling and repooling a server with conftool; this time it should succeed.
- Verify that etcdmirror is replicating correctly by tailing /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log in codfw; you should see updates corresponding to the depool and repool in the last step.
- Restore the TTL to 5 minutes https://gerrit.wikimedia.org/r/#/c/356137/
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
Reimage cluster
Steps to reimage a conf cluster step by step (in this case conf2 in codfw at the example of doing conf2004).
- Change SRV client record to point to the other cluster gerrit
- Update authdns
ssh dns1004.wikimedia.org "sudo -i authdns-update"
- Restart all confd instances and navtimint to pick upt he new DNS records
# batched restart of confd
sudo cumin -b 50 -s 20 'C:confd' 'systemctl restart confd'
# and navtiming
sudo cumin webperf2003.codfw.wmnet 'systemctl restart navtiming.service'
- Make Pybal use the other cluster gerrit
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'run-puppet-agent'
# check lvs config
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'grep conf1 /etc/pybal/pybal.conf || true'
# LOG TO SAL
# restart pybal on secondaries
sudo cumin 'lvs2014.codfw.wmnet,lvs5006.eqsin.wmnet,lvs4010.ulsfo.wmnet' 'systemctl restart pybal'
# LOG TO SAL
#restart pybal on primaries
sudo cumin -b 1 -s 5 'lvs201[1-3].codfw.wmnet,lvs500[4-5].eqsin.wmnet,lvs400[8-9].ulsfo.wmnet' 'systemctl restart pybal'
- Ensure nothing uses etcd on conf2*
- Check
/var/log/nginx/etc*access.log
- Check
ss -apn | grep 4001
- Check
- Reimage
# reimage
sudo cookbook sre.hosts.reimage --os bullseye -t T332010 conf2004
- Delete and re-add the etcd member from the etcd cluster
# Get the member-id of conf2004
etcdctl -C https://$(hostname -f):2379 member list
# Using etcdctl does not work here because it will use tcp/4001 as client port which will block writing to /v2/member
curl -X DELETE https://$(hostname -f):2379/v2/members/<MEMBER-ID>
curl -X POST https://$(hostname -f):2379/v2/members -H "Content-Type: application/json" -d '{"peerURLs":["https://conf2004.codfw.wmnet:2380"]}'
- Restart etcd on the reimaged host with
ETCD_INITIAL_CLUSTER_STATE="existing"
systemctl stop etcd
source /etc/default/etcd
rm -rf ${ETCD_DATA_DIR}/*
sed -i 's/ETCD_INITIAL_CLUSTER_STATE.*/ETCD_INITIAL_CLUSTER_STATE="existing"/' -i /etc/default/etcd
systemctl start etcd
run-puppet-agent
- Ensure etcd and zookeeper are happy before going to the next one
echo ruok | nc localhost 2181; echo; echo stats | nc localhost 2181; echo; etcdctl -C https://$(hostname -f):2379 cluster-health