SLO/etcd main cluster
SLO Worksheet - etcd Main
etcd is an open source key-value store with a focus on reliability that is used to store configuration and state data for distributed systems. At WMF we run a number of etcd clusters, this document addresses the two etcd Main clusters, one each installed in the primary datacenters, eqiad and codfw. A number of applications, including mediawiki read/write configuration store state data on etcd.
etcd is owned by the Service Operations SRE team, which is responsible for all aspects including operation, scalability, backups and software updates. Contact: firstname.lastname@example.org and https://office.wikimedia.org/wiki/Contact_list#Service_Operations
An etcd Main cluster consists of 3 nodes. Each of the etcd nodes can answer read requests, but write requests are handled by a single node, the “leader”. If the leader node becomes non functional the remaining nodes (the “followers”) elect a new leader maintaining the cluster functional. The election of the new leader is managed via the RAFT algorithm. etcd itself replicates data between the nodes and each node has a complete copy of the data. High availability is provided by etcd out of the box, by a combination of etcd client and server software.
Etcd is a foundational service and does not have any hard dependencies beyond hardware and networking. It is worth pointing out that server hardware and networking have their own failure rates that are in the 99% range. Etcd as configured is able to deal with a certain type of failures in a local datacenter.
Etcd is a foundational service and does not have any soft dependencies beyond hardware and networking.
|pybal/LVS||retrieve LB pools servers lists, weights, state||custom python/twisted, host only||watch||will keep working until restart|
|varnish/traffic||retrieve list of backend servers||confd||3s||will keep working|
|gdnsd/auth dns||Write admin state files for discovery.wmnet records||confd||3s||will keep working|
|scap/deployment||Dsh lists||confd||60 s||will keep working|
|redis||Replica configuration (changes NOT applied)||confd||60 s||will keep working|
|parsoid||Use of http or https to connect to the mw api||confd||60 s||will keep working|
|MediaWiki||fetch some config variables||php-curl.
PHP requests at intervals, cached in APCu.
|10 s||will keep working until restart|
|Icinga servers||Update a local cache of the last modified index to be used by other checks||cURL||30 s||the checks will use stale data for comparison|
|conftool/dbctl/cumin||Used to pool/depool hosts or datacenters and populate data after a puppet-merge||Python requests||N/A||Will fail, but that’s ok|
Confd: a lightweight configuration management daemon focused on keeping local configuration files up-to-date using data stored in etcd - Wikitech
Host only: the client can only connect to a single host, has no failover capability.
- GET/QGET/HEAD requests. Those are the requests that are only reading from the datastore. It’s the bulk of the requests, amounting over the course of 30 days to 120-130 Million GET requests and 350-400 Million Quorum GET (QGET) requests.
- PUT/DELETE requests. Those are only sent by conftool, dbctl and cumin. They are very rare compared to the read requests. Over the course of 30 days, we count DELETEs in the ballpark of 50-100 and PUTs on the order of 3000-4000.
Service Level Indicators (SLI)
SRE looks at three SLIs for etcd: availability, acceptable latency rate and error rate.
We measure over a three-month SLO period that ends 1 month before the fiscal quarter for reporting reasons, i.e. the SLO period is December, January, February and so forth.
Availability is determined by monitoring all transactions and calculated for the SLO period: dividing the difference of all transactions and the number of failed (all errors, except 404) or slow (i.e. over 32 ms) transactions by all transactions.
Acceptable latency is determined by monitoring all transactions against the etcd store and calculated over our SLO period: dividing the difference of all transactions and the number of transactions over 32 ms by all transactions.
Error rate is determined by monitoring all transactions against the etcd store and calculated over the SLO period: dividing the number of failed transactions by all transactions.
SLI for GET requests in %:
100% * (all GET transactions - GET transaction slower than 32 ms) / all GET transactions
SLI for Errors:
100% * (transactions with an error code >= 500) / all transactions
Reference: Golden signals includes latency, traffic, errors, and saturation
- Latency: p99 +/- 15 ms GET
- Latency: p99 +/- 2ms (that’s 50%) QGET
- Failures: 0.00 fps about 300 rps
All errors are 404 at the moment, they are not counted, we only count 500 type errors.
The health of the etcd Main cluster is monitored by icinga. In case of alerts follow the troubleshooting procedures defined in the alert message, hosted on wikitech at Etcd/Main cluster.
The service isn’t often deployed. It only gets deployed when doing OS upgrades.
Service Level Objectives
- Request SLO: 99.9% of requests will be successful, resulting in an Request Error Rate Budget of 0.1% of requests.
- Latency SLO: 99.8% of requests will be under 32 ms, resulting in an Latency Error Budget of 0.2%.
Whole book - 2 copies Google SRE Drive
We do have metrics for etcd, we lack a dashboard - dashboard now exists:
Examples of past issues: Performance degradation due to RAID resyncing starving etcd from needed IOPS.
- AWS EC2 SLA: 90% single EC2 VM
- Azure: 95% for single VM with HD, 99.5% for VM with SSD, Premium VM 99.9%