Etcd/Main cluster

From Wikitech
Jump to navigation Jump to search

The main etcd cluster is the cluster used as a state management system for the WMF production cluster.

Usage in production

More and more systems depend on etcd for retreiving state information. All current uses are listed in the table below

software use connection interval failure mode
pybal/LVS retrieve LB pools servers lists, weights, state custom python/twisted, host only watch will keep working until restart
varnish/traffic retrieve list of backend servers confd (watch) watch will keep working
gdnsd/auth dns Write admin state files for discovery.wmnet records confd watch will keep working
scap/deployment Dsh lists confd 60 s will keep working
redis Replica configuration (changes NOT applied) confd 60 s will keep working
parsoid Use of http or https to connect to the mw api confd 60 s will keep working
MediaWiki fetch some config variables PHP connection, request at intervals 10 s will keep working until restart
Icinga servers Update a local cache of the last modified index to be used by other checks cURL 30 s the checks will use stale data for comparison

In a failure, all systems will become unable to modify any configuration it derives from etcd, but they will keep working. Only a subset of those will survive a service restart though.

Architecture

The main cluster is composed of two separated sub-clusters: the "codfw.wmnet" and "eqiad.wmnet" ones (creatively name after the datacenters they're located in) that are not connected via RAFT consensus, but via replication, so that there is always a master cluster and a slave one.

Consistency

For reads that don't require sub-second consistency cluster-wide, reading from the slave cluster is acceptable. If replication breaks, this will page opsens that will be able to correct the issue quickly enough (worst case scenario, by pointing clients to the master dc), All writes should go to the master datacenter; we ensure that the slave cluster is in read-only mode for remote clients to avoid issues.

Replication

Replication works using etcdmirror - a pretty raw software we wrote internally that allows replicating from one cluster to another mangling key prefixes. This is supposed to offer the functionality that etcdctl mirror-maker provides on etcd 3 to etcd 2 clusters.

Etcdmirror runs from one machine on the slave cluster; it reads the etcd index to replicate from in /__replication/$destination_prefix , and issues a recursive watch request to the source cluster starting at the recorded index, and then recursively replicating every write that happens under $source_prefix in the source cluster. Since we're (at the moment) only interested in the /conftool directory, that's what we're replicating between the two clusters. Logs from the application are usually pretty telling about what is going wrong.

The replica daemon is very strict and will fail as soon as any inconsistency is found (even just in the original value of a key) or if the lag is large enough that we're losing etcd events. In such a case, you will need to do a full reload; to do that you need to launch etcdmirror adding the --reload switch. Beware: doing so will ERASE ALL DATA on the destination cluster, so do that with extreme caution.

Individual cluster configuration

We decided to proxy external connections to etcd via an nginx proxy that handles TLS and HTTP authentication and should be fully compliant with etcd's own behaviour. The reason for this is that the builtin authentication gives a severe performance hit to etcd, and that our TLS configuration for nginx is much better than what etcd itself offers. It also gives us the ability to switch on/off the read-only status of a cluster by flipping a switch in puppet. I don't know of any way to do this with the standard etcd mechanism without actually removing users and/or roles, a slow process that is hard to automate/puppetize. So what happens is that on every host we have an etcd instance listening for client connections on http://127.0.0.1:2378 with no authentication. So local clients can write to it unauthenticated. It does, however, advertise https://$fqdn:2379 as its client URL, which is where NGINX is listening for external connections and enforces authentication as well.

Operations

For the most part, you can refer to what written in Etcd, but there are a few more operations regarding replication that are not covered there.

Master cluster switchover

From https://phabricator.wikimedia.org/T166552

Play-by-play:

  1. Merge https://gerrit.wikimedia.org/r/356138
  2. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent' (begins read-only)
  3. sudo cumin 'R:class = role::configcluster' 'disable-puppet "etcd replication switchover"'
  4. Merge https://gerrit.wikimedia.org/r/#/c/356139,
  5. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
  6. Merge https://gerrit.wikimedia.org/r/#/c/356136/ and update dns
  7. sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
  8. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
  9. Merge https://gerrit.wikimedia.org/r/356341
  10. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent' (ends read-only)
  11. Merge and deploy https://gerrit.wikimedia.org/r/#/c/356137/