The main etcd cluster is the cluster used as a state management system for the WMF production cluster.
Usage in production
More and more systems depend on etcd for retreiving state information. All current uses are listed in the table below
|pybal/LVS||retrieve LB pools servers lists, weights, state||custom python/twisted, host only||watch||will keep working until restart|
|varnish/traffic||retrieve list of backend servers||confd (watch)||watch||will keep working|
|gdnsd/auth dns||Write admin state files for discovery.wmnet records||confd||watch||will keep working|
|scap/deployment||Dsh lists||confd||60 s||will keep working|
|redis||Replica configuration (changes NOT applied)||confd||60 s||will keep working|
|parsoid||Use of http or https to connect to the mw api||confd||60 s||will keep working|
|MediaWiki||fetch some config variables||PHP connection, request at intervals||10 s||will keep working until restart|
|Icinga servers||Update a local cache of the last modified index to be used by other checks||cURL||30 s||the checks will use stale data for comparison|
In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.
The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a slave one.
For reads that don't require sub-second consistency cluster-wide, reading from the slave cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the slave cluster is in read-only mode for remote clients to avoid issues.
Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that
etcdctl mirror-maker provides on etcd 3 to etcd 2 clusters.
Etcdmirror runs from one machine on the slave cluster; it reads the etcd index to replicate from in
/__replication/$destination_prefix , and issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicating every write that happens under
$source_prefix in the source cluster. Since we're (at the moment) only interested in the
/conftool directory, that's what we're replicating between the two clusters. Logs from the application are usually pretty telling about what is going wrong.
The replica daemon is very strict and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events. In such a case, you will need to do a full reload; to do that you need to launch etcdmirror adding the
--reload switch. Beware: doing so will ERASE ALL DATA on the destination cluster, so do that with extreme caution.
Individual cluster configuration
We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize. So what happens is that on every host we have an etcd instance listening for client connections on http://127.0.0.1:2378 with no authentication. So local clients can write to it unauthenticated. It does, however, advertise https://$fqdn:2379 as its client URL, which is where NGINX is listening for external connections and enforces authentication as well.
For the most part, you can refer to what written in Etcd, but there are a few more operations regarding replication that are not covered there.
Master cluster switchover
- Merge https://gerrit.wikimedia.org/r/356138
- sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent' (begins read-only)
- sudo cumin 'R:class = role::configcluster' 'disable-puppet "etcd replication switchover"'
- Merge https://gerrit.wikimedia.org/r/#/c/356139,
- sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
- Merge https://gerrit.wikimedia.org/r/#/c/356136/ and update dns
- sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
- sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
- Merge https://gerrit.wikimedia.org/r/356341
- sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent' (ends read-only)
- Merge and deploy https://gerrit.wikimedia.org/r/#/c/356137/