Incidents/2024-02-28 etcd-mirror

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2024-02-28 etcd-mirror Start 2024-02-28 01:39
Task T358636 End 2024-02-28 02:52
People paged 15 Responder count 2
Coordinators 1 Affected metrics/SLOs None. Each of the two etcd clusters was healthy, so the etcd SLO was unaffected. Replication is not covered by an SLO.
Impact No user impact. Near miss for split brain config scenarios, as described below.

Conftool data, stored in etcd, is normally replicated from one data center to the other (currently eqiad to codfw) using an in-house tool called etcd-mirror. Etcd-mirror is intentionally fairly brittle to replication problems; when it encounters an unexpected situation, it stops replicating and pages us, rather than attempt to self-recover and risk corrupting the etcd state.

Due to an unusual sequence of data changes (detailed below) etcd-mirror's long-running watch lost track of the up-to-date state of etcd in eqiad, and replication stopped. This had no immediate user impact, but subsequent conftool changes (like pooling/depooling hosts in load-balanced clusters, pooling/depooling data centers in service discovery, or requestctl changes) would not have been propagated between data centers, leaving us without emergency tools and potentially causing a dangerous skew in production config.

Timeline

All times in UTC.

  • 00:00 Last conftool etcd event is successfully replicated from eqiad to codfw through etcd-mirror. (The midnight UTC timestamp is a coincidence.)
Feb 28 00:00:26 conf2005 etcdmirror-conftool-eqiad-wmnet[2619978]: [etcd-mirror] INFO: Replicating key /conftool/v1/request-ipblocks/cloud/linode at index 3020127
  • 00:00-01:39 An unusual number of reimage and other cookbooks are run (acquiring and releasing locks in etcd), and there are no changes to conftool data. Unusually, this produces 1,000 consecutive non-conftool etcd events.
  • 01:39 etcd-mirror fails:
Feb 28 01:39:44 conf2005 etcdmirror-conftool-eqiad-wmnet[2619978]: CRITICAL: The current replication index is not available anymore in the etcd source cluster.
Feb 28 01:39:44 conf2005 etcdmirror-conftool-eqiad-wmnet[2619978]: [etcd-mirror] CRITICAL: The current replication index is not available anymore in the etcd source cluster.
Feb 28 01:39:44 conf2005 etcdmirror-conftool-eqiad-wmnet[2619978]: [etcd-mirror] INFO: Restart the process with --reload instead.
Feb 28 01:39:46 conf2005 systemd[1]: etcdmirror-conftool-eqiad-wmnet.service: Main process exited, code=exited, status=1/FAILURE
  • 01:42 Alerts fire:
01:42:41 <jinxer-wm> (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
01:45:25 <jinxer-wm> (SystemdUnitFailed) firing: etcdmirror-conftool-eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
  • 01:46 rzl and swfrench consider restarting with --reload as indicated but decide to investigate first.
  • 02:03 swfrench curls https://conf2005.codfw.wmnet:4001/v2/keys/__replication and gets a value matching the last successful read at 00:00:26:
{
    "action": "get",
    "node": {
        "key": "/__replication",
        "dir": true,
        "nodes": [
            {
                "key": "/__replication/conftool",
                "value": "3020127",
                "modifiedIndex": 5150719,
                "createdIndex": 4271
            }
        ],
        "modifiedIndex": 2140,
        "createdIndex": 2140
    }
}
  • 02:06 rzl and swfrench consider whether to try restarting etcd-mirror without --reload, in case the error is trivially recoverable; as we're talking about it, the unit restarts itself and fails with the same error. This causes some confusion before we simply note that it didn't work and move on. (We thought it might have been someone participating in the incident response without speaking up on IRC; it was just Puppet happening to run at the instant we were talking about it.)
Feb 28 02:07:22 conf2005 systemd[1]: Started Etcd mirrormaker.
Feb 28 02:07:23 conf2005 etcdmirror-conftool-eqiad-wmnet[3349905]: [etcd-mirror] INFO: Current index read from /__replication/conftool
Feb 28 02:07:23 conf2005 etcdmirror-conftool-eqiad-wmnet[3349905]: [etcd-mirror] INFO: Starting replication at 3020127
Feb 28 02:07:23 conf2005 etcdmirror-conftool-eqiad-wmnet[3349905]: CRITICAL: The current replication index is not available anymore in the etcd source cluster.
Feb 28 02:07:23 conf2005 etcdmirror-conftool-eqiad-wmnet[3349905]: [etcd-mirror] CRITICAL: The current replication index is not available anymore in the etcd source cluster.
Feb 28 02:07:23 conf2005 etcdmirror-conftool-eqiad-wmnet[3349905]: [etcd-mirror] INFO: Restart the process with --reload instead.
Feb 28 02:07:23 conf2005 systemd[1]: etcdmirror-conftool-eqiad-wmnet.service: Main process exited, code=exited, status=1/FAILURE
  • 02:21 swfrench confirms with curl -v 'https://conf1009.eqiad.wmnet:4001/v2/keys/conftool?wait=true&waitIndex=3020127' that that's the problem: X-Etcd-Index is just over 1,000 events ahead of the wait index. That would indicate there were more than 1,000 events outside the /conftool keyspace -- such as locks for Spicerack -- without any events under /conftool. Since this was an unusually busy day for reimage cookbooks, with Spicerack locks newly added in T341973, this is both a new failure mode and a complete explanation of the root cause and trigger.
  • 02:25 swfrench considers manually advancing /__replication/conftool in codfw by 999. This would advance the replication index until just prior to the end of the "safe" 1,000-event window where we knew there were no conftool events, which also happened to be well within the 1,000-event retention window bounded by the then-current etcd index in eqiad (3021261). (Advancing it beyond the safe window could skip events that should have been replicated, potentially leading to a dangerous split-brain condition.) The alternative would be restarting etcd-mirror with --reload, which obliterates the state of etcd in the replicated cluster before recreating it, something which has aggravated incidents in the past; in this case a successful --reload should be safe, but if it hit a snag, we were uncertain of being able to recover quickly with the staff available.
  • 02:40 We decide to proceed with manually advancing the index, on the theory that if it doesn't work we can always proceed with an etcd-mirror reload; even if it does work, we can always follow up with a reload during working hours to ensure a clean state.
  • 02:50 swfrench runs curl https://conf2005.codfw.wmnet:2379/v2/keys/__replication/conftool -XPUT -d "value=3021126" on conf2005.
  • 02:52 swfrench restarts etcd-mirror, which resumes normally.
  • 02:55 Alerts recover:
02:55:25 <jinxer-wm> (SystemdUnitFailed) resolved: etcdmirror-conftool-eqiad-wmnet.service on conf2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
02:57:41 <jinxer-wm> (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
  • 03:03 swfrench depools, then repools mw2268, in order to test replication; the conftool events propagate via etcd-mirror successfully.

Detection

We were paged when etcd-mirror failed. (A prior incident was exacerbated after the paging was accidentally dropped in the migration to alertmanager; that was fixed in T317402 as a followup. It was only a moderate factor in that incident, because SREs independently noticed errors in the etcd-dependent tools they were using. By contrast, if the alert hadn't paged us here, it's likely nobody would have noticed the problem for much longer.)

In addition to that EtcdReplicationDown alert, we also got a non-paging SystemdUnitFailed alert, which is redundant but not harmfully so (and might be a useful extra pointer for responders not familiar with etcd-mirror).

We also got five separate non-paging JobUnavailable alerts for etcd-mirror, at 1:48, 2:38, 2:38, 2:43, and 2:53, the last one being after etcd-mirror had resumed. These didn't contribute any extra information.

Conclusions

What went well?

  • Automated monitoring detected the replication failure.
  • Wikitech documentation on etcd and etcd-mirror, along with etcd's own documentation, gave us most of what we needed to root-cause and work around the problem.
  • Manually advancing the index value was both feasible to do (because of the expressive REST API allowing us to modify the replication state) and an effective resolution (in part because there were less than 2,000 total unreplicated events, meaning we could advance to an available state with confidence that we weren't losing any data).

What went poorly?

  • T317537 hadn't been addressed after a prior incident, so we had little guidance on safely restarting etcd-mirror, especially with --reload.

Where did we get lucky?

  • Scott had just started digging into the internals of etcd-mirror, including a meeting less than 36 hours earlier on its failure modes, so he was a fresh subject matter expert and (despite not being on call yet) happened to be available at the time of the incident.
  • Nothing else went wrong simultaneously, during the time when conftool replication was unavailable.
  • Fewer than 2,000 non-conftool etcd events had occurred in eqiad by the time we identified a potential mitigation (otherwise we could not safely advance the index to a point that would allow replication to recover).

Links to relevant documentation

Actionables

  • Long-term solution: Switch conftool from etcd v2 to v3. As part of this effort, etcd-mirror will be replaced with an alternative solution for cross-site replication, a requirement for which is resilience to this class of problem. (T350565)
  • In the meantime, consider building the recovery described at https://etcd.io/docs/v2.3/api/#watch-from-cleared-event-index into etcd-mirror, so that it can detect and handle this situation automatically. It may or may not be worth the engineering effort, depending on the timeline for moving to v3, but now that there's much more traffic to etcd keys outside of /conftool (notably because of Spicerack locking) this situation is more likely than it used to be, so we may want to be resilient to it even in the short term. Alternatively, we could replicate the entire keyspace and not just /conftool, if etcd-mirror can sustain the load. (T358636)
  • Update Etcd/Main cluster#Replication documentation with safe restart conditions and information. Populate the "TODO" playbook link for EtcdReplicationDown in the process. (T317537)

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no There were only two responders and no user impact, so we didn't bother with a doc.
Was a public wikimediastatus.net entry created? no No user impact, so a status page entry wasn't appropriate.
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

no T317537
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 7