etcd

From Wikitech

etcd is a distributed key/value store.

Use at WMF

We currently have:

  1. One cluster in eqiad for general use, running with https and client AUTH, part of the Etcd/Main cluster
  2. One cluster in codfw for general use, running with https and client AUTH, part of the Etcd/Main cluster
  3. One cluster on ganeti for kubernetes in eqiad, running with https; access is firewall-controlled.

Example "hello world" query listing keys

razzi@cumin1001:~$ etcdctl -C https://conf1004.eqiad.wmnet:4001 ls /conftool/v1/pools/eqiad/
/conftool/v1/pools/eqiad/phabricator
/conftool/v1/pools/eqiad/aqs
...

Operations

Note: There's no TLS for peer communications yet, so pay close attention to http vs https in the URLs and the port numbers used in various places.

Bootstrapping an etcd cluster

Before starting, there are a couple of things to keep in mind:

  • The etcd version suggested is 3.x, the version 2.x is unsupported. Note that the version of etcd isn't related to the version of the protocol. That is, version 3 of etcd speaks both version 2 and 3 of the protocol.
  • The supported production method to bootstrap an etcd cluster is by using DNS SRV records for discovery.
  • Traffic between nodes of the etcd cluster is encrypted, so TLS Certificates will be needed for both of the above use cases.
    • These certificates can now be created automatically by the PKI puppet module. Configure the use_pki_certs parameter to be true when applying profile::etcd::v3 in order to use this functionality.
    • Prior to this method, Cergen was the tool to use do to create the keys and certificates. Please check other examples in the puppet private repo yaml configs.
  • Clients can be forced to use TLS client auth if needed, by adding the ::profile::etcd::v3::tlsproxy profile to the cluster role's config.
  • The etcd cluster needs to be configured to allow node to know about each other. You will need to set the profile::etcd::v3::discovery hiera setting: to dns:<SRV_RECORD_NAME>, that implies auto-discovery via DNS SRV records. For example, if you set dns:k8s3.%{::site}.wmnet then something like the following needs to be added to the DNS repo
    # eqiad
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd1004.eqiad.wmnet.
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd1005.eqiad.wmnet.
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd1006.eqiad.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd1004.eqiad.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd1005.eqiad.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd1006.eqiad.wmnet.
    
    # codfw
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd2004.codfw.wmnet.
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd2005.codfw.wmnet.
    templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M  IN SRV      0 1 2380 kubetcd2006.codfw.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd2004.codfw.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd2005.codfw.wmnet.
    templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M  IN SRV      0 1 2379 kubetcd2006.codfw.wmnet.
    

Let's now try to follow the procedure to bootstrap a new cluster composed by various etcd100x.example.com nodes. You can do the following:

  1. Assign the profile profile::etcd::v3 to your servers roles, if using TLS client auth also add, profile::etcd::v3::tlsproxy
  2. Define the following variables via hiera:
# Name of the cluster. 
profile::etcd::v3::cluster_name: "<CLUSTER_NAME>"
# Set to true when first building the cluster, it should be set to false if adding/removing members
profile::etcd::v3::cluster_bootstrap: true
# set this to "dns:<SRV_RECORD_NAME>" to use dns discovery                                                                      
profile::etcd::v3::discovery: "dns:pinkunicorn.%{site}.wmnet"
# Set to true if you want to use client cert auth. Recommended: false.
profile::etcd::v3::use_client_certs: false
profile::etcd::v3::do_backup: false                                                                                                                                               
profile::etcd::v3::allow_from: "$DOMAIN_NETWORKS"
# For the TLS proxy, you need the following variables too:                                                                                                                                
# This cert is generated using puppet-ecdsacert, and includes                                                                                                                
# all the hostnames for the etcd machines in the SANs                                                                                                                        
# Will need to be regenerated if we add servers to the cluster.                                                                                                              
profile::etcd::v3::tlsproxy::cert_name: "etcd.%{::domain}"                                                                                                                       
profile::etcd::v3::tlsproxy::acls: { /: ["root"], /conftool: ["root", "conftool"], /eventlogging: []}                                                                            
# This should come from the private hieradata                                                                                                                                                                                                                                                                         
#profile::etcd::v3::tlsproxy::salt

Now run puppet on one node, and it should bring up an etcd cluster. You can verify this with:

$ etcdctl -C https://$(hostname -f):2379 cluster-health

Now you can run puppet on the other nodes of the cluster and they should come up and be configured correctly.

Once verified, flip the profile::etcd::cluster_bootstrap hiera variable to 'true' from 'false', and continue adding more nodes via the following procedure.

Adding a new member to the cluster

Say we want to add a new server called etcd1YYY.example.com (we keep this example irrelevant to actual hosts on purpose) to our cluster. The steps are as follows:

  1. Add the member via the members api, using the etcdctl tool using one of the already existing members, e.g. etcd1XXX.
    $ etcdctl --endpoints https://etcd1XXX.example.com:2379 member add etcd1YYY https://etcd1YYY.example.com:2380
    Added member named etcd1YYY with ID 5f62a924ac85910 to cluster
    
    ETCD_NAME="etcd1YYY"
    # Next line is broken down artificially for ease of reading
    ETCD_INITIAL_CLUSTER="etcd1XXX=http://etcd1XXX.example.com:2380,
                          etcd1YYY=http://etcd1YYY.example.com:2380,
    ETCD_INITIAL_CLUSTER_STATE="existing"
    
    Write down the output as it will be useful for our puppet changes.
  2. Assign the etcd role to the node in puppet.
  3. If not using discovery SRV records (which is should be an edge case we don't have yet, consult with someone first), set the following variables for the whole cluster:
    profile::etcd::discovery set to the value of ETCD_INITIAL_CLUSTER from the output of the etcdctl command before
  4. Run puppet on the host. It should join the cluster. Confirm this is the case with the other hosts in the cluster as well (the logs should stop complaining about not reaching the new member)
  5. Finally, add the new server to the SRV records that clients consume.
  6. Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started.

Removing a member from the cluster

  1. Verify the node you want to remove is not the current leader, that could run us into trouble:
    $ curl -k -L https://etcd1001:2379/v2/stats/leader
    {"message":"not current leader"}
    
  2. Remove the server from the clients SRV record
  3. Dynamically remove the server from the cluster:
    $ etcdctl -C https://conf1001.example.com:2379 member remove etcd1001 http://etcd1001.example.com:2380
    $ etcdctl -C https://conf1001.example.com:2379 cluster-health
    
  4. Remove the server from the cluster's SRV record if present, or from the hiera variable profile::etcd::discovery if not using SRV records
  5. Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started.

Recover a cluster after a disaster

In the sad case when RAFT consensus is lost and there is no quorum anymore, the only way to recover the cluster is to recover the data from a backup, which are regularly performed every night in /srv/backups/etcd. The procedure to bring back the cluster is roughly as follows:

  • Stop all etcd instances that might be still running
  • Copy the backup to a new location, start etcd from there; the etcd server listening to the public endpoints with the --force-new-cluster option. It will start with peer urls bound to localhost.
  • Change the peer url of this server to what you'd expect it to be in normal situations
  • Add your other servers to the cluster, as follows:
    • Verify the original etcd data are removed
    • Add the server to the cluster logically with etcdctl
    • Start etcd in order to join the cluster.

As usual with etcd, the devil lies in the details of the command-line options; but there is a python script that, given the current cluster configuration, can generate the correct commands you'll have to enter into a shell. It can be found in the paste at P3855.

Reimage nodes a cluster

If you need to reimage nodes in cluster, there are two strategies that you can follow:

  • Reimage one node at the time, while preserving the distributed log's data. This strategy works only if you remove/add the node via etcdctl after every reimage, since otherwise etcd will refuse to start on it (complaining about the Raft log being not up to date).
  • Reimage all nodes at once, hence not preserving the distributed log's data.

In both of the above cases the cluster needs to be configured in status "new" (and not "existing"), via profile::etcd::v3::cluster_bootstrap: true

Another idea could be to stop all the etcd daemons on all the nodes, and reimage one node at the time. This may work, but since we use ETCD_DISCOVERY_SRV etcd is likely going to contact the nodes in the cluster while bootstrapping (for example, to do leader election) ending up in connection failures.

See also