SLO/etcd main cluster

From Wikitech
< SLO

SLO Worksheet - etcd Main

Service

etcd is an open source key-value store with a focus on reliability that is used to store configuration and state data for distributed systems. At WMF we run a number of etcd clusters, this document addresses the two etcd Main clusters, one each installed in the primary datacenters, eqiad and codfw. A number of applications, including mediawiki read/write configuration store state data on etcd.

Teams

etcd is owned by the Service Operations SRE team, which is responsible for all aspects including operation, scalability, backups and software updates. Contact: sre-serviceops@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Service_Operations

Architectural

An etcd Main cluster consists of 3 nodes. Each of the etcd nodes can answer read requests, but write requests are handled by a single node, the “leader”. If the leader node becomes non functional the remaining nodes (the “followers”) elect a new leader maintaining the cluster functional. The election of the new leader is managed via the RAFT algorithm. etcd itself replicates data between the nodes and each node has a complete copy of the data. High availability is provided by etcd out of the box, by a combination of etcd client and server software.

Hard Dependencies

Etcd is a foundational service and does not have any hard dependencies beyond hardware and networking. It is worth pointing out that server hardware and networking have their own failure rates that are in the 99% range. Etcd as configured is able to deal with a certain type of failures in a local datacenter.

Soft Dependencies

Etcd is a foundational service and does not have any soft dependencies beyond hardware and networking.

Client-facing

Clients

software use connection interval failure mode
(etcd down)
pybal/LVS retrieve LB pools servers lists, weights, state custom python/twisted, host only watch will keep working until restart
varnish/traffic retrieve list of backend servers confd 3s will keep working
gdnsd/auth dns Write admin state files for discovery.wmnet records confd 3s will keep working
scap/deployment Dsh lists confd 60 s will keep working
redis Replica configuration (changes NOT applied) confd 60 s will keep working
parsoid Use of http or https to connect to the mw api confd 60 s will keep working
MediaWiki fetch some config variables php-curl.

PHP requests at intervals, cached in APCu.

10 s will keep working until restart
Icinga servers Update a local cache of the last modified index to be used by other checks cURL 30 s the checks will use stale data for comparison
conftool/dbctl/cumin Used to pool/depool hosts or datacenters and populate data after a puppet-merge Python requests N/A Will fail, but that’s ok
spicerack/cookbooks Used to acquire distributed locks with concurrency and TTL to limit concurrent actions. python-etcd N/A Can be disabled via config file or on a per-run basis with the --no-locks CLI flag

Confd: a lightweight configuration management daemon focused on keeping local configuration files up-to-date using data stored in etcd - Wikitech

Host only: the client can only connect to a single host, has no failover capability.

Request Classes

Reads:

  • GET/QGET/HEAD requests. Those are the requests that are only reading from the datastore. It’s the bulk of the requests, amounting over the course of 30 days to 120-130 Million GET requests and 350-400 Million Quorum GET (QGET) requests.

Writes:

  • PUT/DELETE requests. Those are only sent by conftool, dbctl and cumin. They are very rare compared to the read requests. Over the course of 30 days, we count DELETEs in the ballpark of 50-100 and PUTs on the order of 3000-4000.

Service Level Indicators (SLI)

SRE looks at three SLIs for etcd: availability, acceptable latency rate and error rate.

We measure over a three-month SLO period that ends 1 month before the fiscal quarter for reporting reasons, i.e. the  SLO period is December, January, February and so forth.

Availability is determined by monitoring all transactions and calculated for the SLO period: dividing the difference of all transactions and the number of failed (all errors, except 404) or slow (i.e. over 32 ms) transactions by all transactions.

Acceptable latency is determined by monitoring all transactions against the etcd store and calculated over our SLO period: dividing the difference of all transactions and the number of transactions over 32 ms by all transactions.

Error rate is determined by monitoring all transactions against the etcd store and calculated over the SLO period: dividing the number of failed transactions by all transactions.

SLI for GET requests in %:

   100% * (all GET transactions - GET transaction slower than 32 ms) / all GET transactions

SLI for Errors:

    100% * (transactions with an error code >= 500) / all transactions

Reference: Golden signals includes latency, traffic, errors, and saturation

Operational

Monitoring

Sample values:

  • Latency: p99 +/- 15 ms GET
  • Latency: p99 +/-  2ms (that’s 50%) QGET
  • Failures: 0.00 fps about 300 rps

All errors are 404 at the moment, they are not counted, we only count 500 type errors.

Troubleshooting

The health of the etcd Main cluster is monitored by icinga. In case of alerts follow the troubleshooting procedures defined in the alert message, hosted on wikitech at Etcd/Main cluster.

Deployment

The service isn’t often deployed. It only gets deployed when doing OS upgrades.

Service Level Objectives

  • Request SLO: 99.9% of requests will be successful, resulting in an Request Error Rate Budget of 0.1% of requests.
  • Latency SLO: 99.8% of requests will be under 32 ms, resulting in an Latency Error Budget of 0.2%.

References

Whole book - 2 copies Google SRE Drive

We do have metrics for etcd, we lack a dashboard - dashboard now exists:

Examples of past issues: Performance degradation due to RAID resyncing starving etcd from needed IOPS.

Hardware failures

  • AWS EC2 SLA: 90% single EC2 VM
  • Azure: 95% for single VM with HD, 99.5% for VM with SSD, Premium VM 99.9%