Incidents/2025-05-29 OpenSearch clusters unavailable
document status: draft
Summary
Incident ID | 2025-05-29 OpenSearch clusters unavailable | Start | 2025-05-29 21:15 |
---|---|---|---|
Task | T395546 | End | Ongoing |
People paged | 2 | Responder count | 2 |
Coordinators | David Causse | Affected metrics/SLOs | |
Impact | There was no user impact, as the datacenter was depooled at the time. |
Timeline
2025-05-28 15:17 inflatador (aka bking, Data Platform SRE), bans 15 elastic/opensearch hosts (SAL link) in preparation for decommission (Phab task).
2025-05-28 21:08 inflatador merges this Puppet patch, which removes access to the cluster for all hosts targeted by the patch (elastic/cirrussearch10[53-67]). Because some of these hosts still hold primary shards, removing the hosts causes ~60 indices to go into red (unwriteable) state.
2025-05-28 21:27 CirrusStreamingUpdaterFlinkJobUnstable alerts begin to fire, as the update pipeline is unable to write to these indices
2025-05-29 07:24 dcausse starts troubleshooting (source: #wikimedia-search IRC)
2025-05-29 08:56 dcausse opens this Slack thread in #data-platform-sre. btullis (Data Platform SRE) begins to assist.
2025-05-29 09:29 dcausse deletes red indices, which restores the cluster to green (writeable) status. The update pipeline begins to recover.
2025-05-29 09:47 dcausse begins work on restoring the broken indices.
2025-05-29 14:50 all indices are recovered. Incident ends
Detection
Human or automated alert? Human (dcausse)
Alerts that fired CirrusStreamingUpdaterFlinkJobUnstable , see above
Where the alerts appropriate? Yes, but many alerts did not fire because they were suppressed (and they were suppressed because we were doing maintenance work on the cluster, ref https://phabricator.wikimedia.org/T388610)
Conclusions
What went well?
What went poorly?
- Banning the nodes didn't work as expected (banning is a best-effort action)
- No maintenance plan/runbook for a fairly large operation (we probably won't ever need to decommission more than 5 of the same type of server at the same time again)
- Important tools (snapshots, dashboards and logging) were not available or insufficiently documented
Where did we get lucky?
- The datacenter was depooled, so there was no user impact
Links to relevant documentation
Actionables
- T395356 Re-enable snapshot repos in production Cirrussearch
- T392222 Create ops-focused OpenSearch dashboard
- T395571 Verify/fix Logstash pipeline for Search Platform-owned OpenSearch clusters
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | Y | |
Were the people who responded prepared enough to respond effectively | N | ||
Were fewer than five people paged? | N/A | ||
Were pages routed to the correct sub-team(s)? | N/A | ||
Were pages routed to online (business hours) engineers? Answer ânoâ if engineers were paged after business hours. | N/A | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | N/A | |
Was a public wikimediastatus.net entry created? | N/A | ||
Is there a phabricator task for the incident? | T395546 | ||
Are the documented action items assigned? | N | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | Y | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer ânoâ if there are open tasks that would prevent this incident or make mitigation easier if implemented. | N | There are some tasks which would have helped with troubleshooting/recovery |
Were the people responding able to communicate effectively during the incident with the existing tooling? | N | ||
Did existing monitoring notify the initial responders? | Y | ||
Were the engineering tools that were to be used during the incident, available and in service? | N | ||
Were the steps taken to mitigate guided by an existing runbook? | N | ||
Total score (count of all âyesâ answers above) | 3 |