Jump to content

Incidents/2025-05-29 OpenSearch clusters unavailable

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-05-29 OpenSearch clusters unavailable Start 2025-05-29 21:15
Task T395546 End Ongoing
People paged 2 Responder count 2
Coordinators David Causse Affected metrics/SLOs
Impact There was no user impact, as the datacenter was depooled at the time.

Timeline

2025-05-28 15:17 inflatador (aka bking, Data Platform SRE), bans 15 elastic/opensearch hosts (SAL link) in preparation for decommission (Phab task).

2025-05-28 21:08 inflatador merges this Puppet patch, which removes access to the cluster for all hosts targeted by the patch (elastic/cirrussearch10[53-67]). Because some of these hosts still hold primary shards, removing the hosts causes ~60 indices to go into red (unwriteable) state.

2025-05-28 21:27 CirrusStreamingUpdaterFlinkJobUnstable alerts begin to fire, as the update pipeline is unable to write to these indices

2025-05-29 07:24 dcausse starts troubleshooting (source: #wikimedia-search IRC)

2025-05-29 08:56 dcausse opens this Slack thread in #data-platform-sre. btullis (Data Platform SRE) begins to assist.

2025-05-29 09:29 dcausse deletes red indices, which restores the cluster to green (writeable) status. The update pipeline begins to recover.

2025-05-29 09:47 dcausse begins work on restoring the broken indices.

2025-05-29 14:50 all indices are recovered. Incident ends

Detection

Human or automated alert? Human (dcausse)

Alerts that fired CirrusStreamingUpdaterFlinkJobUnstable , see above

Where the alerts appropriate? Yes, but many alerts did not fire because they were suppressed (and they were suppressed because we were doing maintenance work on the cluster, ref https://phabricator.wikimedia.org/T388610)

Conclusions

What went well?

What went poorly?

  • Banning the nodes didn't work as expected (banning is a best-effort action)
  • No maintenance plan/runbook for a fairly large operation (we probably won't ever need to decommission more than 5 of the same type of server at the same time again)
  • Important tools (snapshots, dashboards and logging) were not available or insufficiently documented

Where did we get lucky?

  • The datacenter was depooled, so there was no user impact

Actionables

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? Y
Were the people who responded prepared enough to respond effectively N
Were fewer than five people paged? N/A
Were pages routed to the correct sub-team(s)? N/A
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. N/A
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? N/A
Was a public wikimediastatus.net entry created? N/A
Is there a phabricator task for the incident? T395546
Are the documented action items assigned? N
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? Y
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. N There are some tasks which would have helped with troubleshooting/recovery
Were the people responding able to communicate effectively during the incident with the existing tooling? N
Did existing monitoring notify the initial responders? Y
Were the engineering tools that were to be used during the incident, available and in service? N
Were the steps taken to mitigate guided by an existing runbook? N
Total score (count of all “yes” answers above) 3