Jump to content

Incidents/2025-03-31 sessionstore unavailability

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-03-31 sessionstore unavailability Start 2025-03-31 02:58
Task T390513 End 2025-03-31 03:36
People paged batphone Responder count 2
Coordinators swfrench-wmf Affected metrics/SLOs No relevant SLOs exist
Impact Edits were failing for the duration of the outage.

Starting at approximately 02:58 the sessionstore service in both datacenters became unavailable after a enough nodes had crashed due to disk exhaustion, that clients were no longer able to make quorum.

At 03:32 the decision was made to wipe storage (removing all sessions) in order to return sessionstore to a working state. This was completed at around 03:36 and edits began to flow again, returning to steady state rates by ~03:40.

Timeline

All times in UTC.

  • 02:58 sessionstore service in both DCs begin emitting 500 errors — OUTAGE BEGINS
  • 03:03 MediaWikiEditFailures (session loss) fires (critical)
  • 03:04 SessionStoreErrorRateHigh fires (page)
  • 03:09 swfrench responds / investigation begins
  • 03:14 swfrench escalates to urandom for investigating the Cassandra aspect
  • 03:18 cassandra unavailability identified (many nodes are down, no longer able to achieve quorum)
  • 03:22 disk space exhaustion identified as cause, along with concerning upward trend in utilization since ~2025-03-10
  • 03:32 decision is made to wipe storage in order to restore service
  • 03:36 storage is wiped, Cassandra is restarted, edits begin to flow — OUTAGE ENDS
MediaWiki Sessionstore
Edit rate during incident
Utilization trend leading up to the event
Rate of set & delete operations
Sessionstore request rates leading up to the event

Detection

SessionStoreErrorRateHigh (high 5xx rate) fired, resulting in a page (sent to batphone). Paging on error rate seems correct, but the 5xx errors returned by sessionstore were caused by the complete outage of many Cassandra nodes, so having an alert (critical) could have provided additional context (i.e. "Critical number of Cassandra node failures" versus "Many HTTP errors"). Moreover, the node failures in question were the result of a lack of free disk space, and the rate of storage growth preceding this made the timing of the outages predictable. An alert for high storage utilization well in advance of the node failures would have provided an opportunity to head off an impacting incident before it began.

Splunk On Call alerts

https://portal.victorops.com/ui/wikimedia/incident/5917/details

User reports

Conclusions

The obvious (proximal) cause was an increase in session writes that filled the storage devices. The distal cause though seems to be an aberrant workload consisting of session overwrites at high rate. Cassandra's storage is log-structured; An overwrite doesn't happen in-place, and the overwritten values must be garbage collected during compaction. The high rate of overwrites exceeded what could be efficiently collected before running out of space.

The investigation continues, but the rollout of SUL3 correlates strongly.

Cluster capacity

Prior to the incident, we viewed the sessionstore cluster as being wildly over-provisioned. This incident however, has demonstrated that an aberrant workload has the potential to create rapid, unsustainable growth.

Observability

FIXME: Do.

What went well?

  • Automated alerting surfaced the influx of sessionstore service errors soon after onset.
  • Responders were able to identify the (proximal) cause and mitigate within ~ 30m of the first page.

What went poorly?

  • There was no automated alerting in place for sessionstore Cassandra node disk utilization and / or growth rate that would have given us the opportunity to intervene before the start of user impact.
  • It was, and continues to be, challenging to understand how various factors contributed to the elevated sessionstore write rates (workload changes, SUL3 migration, etc.).
  • The mitigation for this (or a future) disk exhaustion incident (truncation) is itself disruptive, though less so than a sustained edit outage.
  • The incident occurred during Batphone hours (early Monday UTC), which contributed to delayed response.

Where did we get lucky?

  • Responders, including one with expertise in Cassandra, were available to work on the incident despite the awkward time of day.

Actionables

  • Page when disk utilization trend becomes unreasonable (task T390630) Yes Done
  • Increase sessionstore storage capacity (task T391544)
  • Sessionstore namespacing (task T392170)
  • Sessionstore workload observability (task T392182)
  • Identify the cause of high rate of storage growth in sessionstore Cassandra (task T390514)
  • Alert (non-paging) on Cassandra error rate(?)
  • Page/alert on Cassandra nodes entering a downed state(?)

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? No
Were the people who responded prepared enough to respond effectively Yes
Were fewer than five people paged? No Baphone page
Were pages routed to the correct sub-team(s)? No
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. No
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? Yes
Was a public wikimediastatus.net entry created? Yes
Is there a phabricator task for the incident? Yes
Are the documented action items assigned? Yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? Yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. Yes
Were the people responding able to communicate effectively during the incident with the existing tooling? Yes
Did existing monitoring notify the initial responders? Yes
Were the engineering tools that were to be used during the incident, available and in service? Yes
Were the steps taken to mitigate guided by an existing runbook? No
Total score (count of all “yes” answers above) 10