Jump to content

Incidents/2025-03-31 sessionstore unavailability

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-03-31 sessionstore unavailability Start 2025-03-31 02:58
Task T390513 End 2025-03-31 03:36
People paged batphone Responder count 2
Coordinators swfrench-wmf Affected metrics/SLOs No relevant SLOs exist
Impact Edits were failing for the duration of the outage.

Starting at approximately 02:58 the sessionstore service in both datacenters became unavailable after a enough nodes had crashed due to disk exhaustion, that clients were no longer able to make quorum.

At 03:32 the decision was made to wipe storage (removing all sessions) in order to return sessionstore to a working state. This was completed at around 03:36 and edits began to flow again, returning to steady state rates by ~03:40.

Timeline

Write a step by step outline of what happened to cause the incident, and how it was remedied. Include the lead-up to the incident, and any epilogue.

Consider including a graphs of the error rate or other surrogate.

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

All times in UTC.

  • 02:58 sessionstore service in both DCs begin emitting 500 errors — OUTAGE BEGINS
  • 03:03 MediaWikiEditFailures (session loss) fires (critical)
  • 03:04 SessionStoreErrorRateHigh fires (page)
  • 03:09 swfrench responds / investigation begins
  • 03:14 swfrench escalates to urandom for investigating the Cassandra aspect
  • 03:18 cassandra unavailability identified (many nodes are down, no longer able to achieve quorum)
  • 03:22 disk space exhaustion identified as cause, along with concerning upward trend in utilization since ~2025-03-10
  • 03:32 decision is made to wipe storage in order to restore service
  • 03:36 storage is wiped, Cassandra is restarted, edits begin to flow — OUTAGE ENDS
MediaWiki Sessionstore
Edit rate during incident
Utilization trend leading up to the event
Rate of set & delete operations
Sessionstore request rates leading up to the event

Detection

SessionStoreErrorRateHigh (high 5xx rate) fired, resulting in a page (sent to batphone). Paging on error rate seems correct, but the 5xx errors returned by sessionstore were caused by the complete outage of many Cassandra nodes, so having an alert (critical) could have provided additional context (i.e. "Critical number of Cassandra node failures" versus "Many HTTP errors"). Moreover, the node failures in question were the result of a lack of free disk space, and the rate of storage growth preceding this made the timing of the outages predictable. An alert for high storage utilization well in advance of the node failures would have provided an opportunity to head off an impacting incident before it began.

Splunk On Call alerts

https://portal.victorops.com/ui/wikimedia/incident/5917/details

User reports

Conclusions

The obvious (proximal) cause was an increase in session writes that filled the storage devices. The distal cause though seems to be an aberrant workload consisting of session overwrites at high rate. Cassandra's storage is log-structured; An overwrite doesn't happen in-place, and the overwritten values must be garbage collected during compaction. The high rate of overwrites exceeded what could be efficiently collected before running out of space.

The investigation continues, but the rollout of SUL3 correlates strongly.

Cluster capacity

Prior to the incident, we viewed the sessionstore cluster as being wildly over-provisioned. This incident however, has demonstrated that an aberrant workload has the potential to create rapid, unsustainable growth.

Observability

FIXME: Do.

What went well?

OPTIONAL: (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

OPTIONAL: (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

OPTIONAL: (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

  • Page when disk utilization trend becomes unreasonable (task T390630) Yes Done
  • Increase sessionstore storage capacity (task T391544)
  • Sessionstore namespacing (task T392170)
  • Sessionstore workload observability (task T392182)
  • Identify the cause of high rate of storage growth in sessionstore Cassandra (task T390514)
  • Alert (non-paging) on Cassandra error rate(?)
  • Page/alert on Cassandra nodes entering a downed state(?)

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents?
Were the people who responded prepared enough to respond effectively
Were fewer than five people paged?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours.
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
Was a public wikimediastatus.net entry created?
Is there a phabricator task for the incident?
Are the documented action items assigned?
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
Were the people responding able to communicate effectively during the incident with the existing tooling?
Did existing monitoring notify the initial responders?
Were the engineering tools that were to be used during the incident, available and in service?
Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)