Etcd/Main cluster
The main etcd cluster is the Etcd cluster used as a state management system for the WMF production cluster. It is operated by SRE Service Ops under the etcd main cluster SLO.
Usage in production
More and more systems depend on etcd for retrieving state information. All current uses are listed in the table below
software | use | connection | interval | failure mode |
---|---|---|---|---|
pybal/LVS | retrieve LB pools servers lists, weights, state | custom python/twisted, host only | watch | will keep working until restart |
varnish/traffic | retrieve list of backend servers; retrieve VCL fragments (requestctl) | confd (watch) | watch | will keep working |
gdnsd/auth dns | Write admin state files for discovery.wmnet records | confd | watch | will keep working |
scap/deployment | Dsh lists | confd | 60 s | will keep working |
MediaWiki | fetch some config variables | PHP connection, request at intervals | 10 s | will keep working until restart |
Icinga servers | Update a local cache of the last modified index to be used by other checks | cURL | 30 s | the checks will use stale data for comparison |
In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.
Architecture
The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a slave one.
Consistency
For reads that don't require sub-second consistency cluster-wide, reading from the slave cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the slave cluster is in read-only mode for remote clients to avoid issues.
Replication
Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that etcdctl mirror-maker
provides on etcd 3 to etcd 2 clusters.
Etcdmirror runs from one machine on the slave cluster; it reads the etcd index to replicate from in /__replication/$destination_prefix
, and issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicating every write that happens under $source_prefix
in the source cluster. Since we're (at the moment) only interested in the /conftool
directory, that's what we're replicating between the two clusters. Logs from the application are usually pretty telling about what is going wrong.
The replica daemon is very strict and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events. In such a case, you will need to do a full reload; to do that you need to launch etcdmirror adding the --reload
switch. Beware: doing so will ERASE ALL DATA on the destination cluster, so do that with extreme caution.
Individual cluster configuration
We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize. So what happens is that on every host we have an etcd instance listening for client connections on http://127.0.0.1:2378 with no authentication. So local clients can write to it unauthenticated. It does, however, advertise https://$fqdn:2379 as its client URL, which is where NGINX is listening for external connections and enforces authentication as well.
Operations
For the most part, you can refer to what written in Etcd, but there are a few more operations regarding replication that are not covered there.
Master cluster switchover
From https://phabricator.wikimedia.org/T166552
Play-by-play:
These instructions assume the primary cluster is currently codfw and moving to eqiad. When moving in the opposite direction, swap the data centers accordingly in each step.
- Reduce the TTL for conftool SRV records to 10 seconds
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
- Start read-only in the dc we're switching from (https://gerrit.wikimedia.org/r/356138)
- sudo cumin A:conf-codfw 'run-puppet-agent' (begins read-only)
- Verify that etcd is read-only by attempting to depool a server with conftool; it should fail.
- To avoid being paged for etcdmirror replication delay, visit icinga and downtime the "Etcd replication lag #page" service.
- sudo cumin A:conf 'disable-puppet "etcd replication switchover"'
- Stop replication in the dc we're switching to (https://gerrit.wikimedia.org/r/#/c/356139)
- sudo cumin 'A:conf-eqiad' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
- Switch the conftool SRV record for read-write access to the dc we're switching to, updating the port if necessary (https://gerrit.wikimedia.org/r/#/c/356136/)
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
- sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
- sudo cumin A:conf-codfw 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
- Set the dc we're switching to as read-write (https://gerrit.wikimedia.org/r/356341)
- sudo cumin A:conf-eqiad 'run-puppet-agent' (ends read-only)
- Verify that etcd is read-write again by depooling and repooling a server with conftool; this time it should succeed.
- Verify that etcdmirror is replicating correctly by tailing /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log in codfw; you should see updates corresponding to the depool and repool in the last step.
- Restore the TTL to 5 minutes https://gerrit.wikimedia.org/r/#/c/356137/
- On authdns1001.eqiad.wmnet (or any authdns machine), sudo authdns-update
Reimage cluster
Steps to reimage a conf cluster step by step (in this case conf2 in codfw at the example of doing conf2004).
- Change SRV client record to point to the other cluster gerrit
- Update authdns
ssh dns1004.wikimedia.org "sudo -i authdns-update"
- Restart all confd instances and navtimint to pick upt he new DNS records
# batched restart of confd
sudo cumin -b 50 -s 20 'C:confd' 'systemctl restart confd'
# and navtiming
sudo cumin webperf2003.codfw.wmnet 'systemctl restart navtiming.service'
- Make Pybal use the other cluster gerrit
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'run-puppet-agent'
# check lvs config
sudo cumin 'P{O:lvs::balancer} and (A:codfw or A:eqsin or A:ulsfo)' 'grep conf1 /etc/pybal/pybal.conf || true'
# LOG TO SAL
# restart pybal on secondaries
sudo cumin 'lvs2014.codfw.wmnet,lvs5006.eqsin.wmnet,lvs4010.ulsfo.wmnet' 'systemctl restart pybal'
# LOG TO SAL
#restart pybal on primaries
sudo cumin -b 1 -s 5 'lvs201[1-3].codfw.wmnet,lvs500[4-5].eqsin.wmnet,lvs400[8-9].ulsfo.wmnet' 'systemctl restart pybal'
- Ensure nothing uses etcd on conf2*
- Check
/var/log/nginx/etc*access.log
- Check
ss -apn | grep 4001
- Check
- Reimage
# reimage
sudo cookbook sre.hosts.reimage --os bullseye -t T332010 conf2004
- Delete and re-add the etcd member from the etcd cluster
# Get the member-id of conf2004
etcdctl -C https://$(hostname -f):2379 member list
# Using etcdctl does not work here because it will use tcp/4001 as client port which will block writing to /v2/member
curl -X DELETE https://$(hostname -f):2379/v2/members/<MEMBER-ID>
curl -X POST https://$(hostname -f):2379/v2/members -H "Content-Type: application/json" -d '{"peerURLs":["https://conf2004.codfw.wmnet:2380"]}'
- Restart etcd on the reimaged host with
ETCD_INITIAL_CLUSTER_STATE="existing"
systemctl stop etcd
source /etc/default/etcd
rm -rf ${ETCD_DATA_DIR}/*
sed -i 's/ETCD_INITIAL_CLUSTER_STATE.*/ETCD_INITIAL_CLUSTER_STATE="existing"/' -i /etc/default/etcd
systemctl start etcd
run-puppet-agent
- Ensure etcd and zookeeper are happy before going to the next one
echo ruok | nc localhost 2181; echo; echo stats | nc localhost 2181; echo; etcdctl -C https://$(hostname -f):2379 cluster-health