Incidents/2025-03-31 sessionstore unavailability
document status: draft
Summary
| Incident ID | 2025-03-31 sessionstore unavailability | Start | 2025-03-31 02:58 |
|---|---|---|---|
| Task | T390513 | End | 2025-03-31 03:36 |
| People paged | batphone | Responder count | 2 |
| Coordinators | swfrench-wmf | Affected metrics/SLOs | No relevant SLOs exist |
| Impact | Edits were failing for the duration of the outage. | ||
Starting at approximately 02:58 the sessionstore service in both datacenters became unavailable after a enough nodes had crashed due to disk exhaustion, that clients were no longer able to make quorum.
At 03:32 the decision was made to wipe storage (removing all sessions) in order to return sessionstore to a working state. This was completed at around 03:36 and edits began to flow again, returning to steady state rates by ~03:40.
Timeline
All times in UTC.
- 02:58 sessionstore service in both DCs begin emitting 500 errors — OUTAGE BEGINS
- 03:03 MediaWikiEditFailures (session loss) fires (critical)
- 03:04 SessionStoreErrorRateHigh fires (page)
- 03:09 swfrench responds / investigation begins
- 03:14 swfrench escalates to urandom for investigating the Cassandra aspect
- 03:18 cassandra unavailability identified (many nodes are down, no longer able to achieve quorum)
- 03:22 disk space exhaustion identified as cause, along with concerning upward trend in utilization since ~2025-03-10
- 03:32 decision is made to wipe storage in order to restore service
- 03:36 storage is wiped, Cassandra is restarted, edits begin to flow — OUTAGE ENDS
| MediaWiki | Sessionstore |
|---|---|
Detection
SessionStoreErrorRateHigh (high 5xx rate) fired, resulting in a page (sent to batphone). Paging on error rate seems correct, but the 5xx errors returned by sessionstore were caused by the complete outage of many Cassandra nodes, so having an alert (critical) could have provided additional context (i.e. "Critical number of Cassandra node failures" versus "Many HTTP errors"). Moreover, the node failures in question were the result of a lack of free disk space, and the rate of storage growth preceding this made the timing of the outages predictable. An alert for high storage utilization well in advance of the node failures would have provided an opportunity to head off an impacting incident before it began.
Splunk On Call alerts
https://portal.victorops.com/ui/wikimedia/incident/5917/details
User reports
Conclusions
The obvious (proximal) cause was an increase in session writes that filled the storage devices. The distal cause though seems to be an aberrant workload consisting of session overwrites at high rate. Cassandra's storage is log-structured; An overwrite doesn't happen in-place, and the overwritten values must be garbage collected during compaction. The high rate of overwrites exceeded what could be efficiently collected before running out of space.
Cluster capacity
Prior to the incident, we viewed the sessionstore cluster as being wildly over-provisioned. This incident however, has demonstrated that an aberrant workload has the potential to create rapid, unsustainable growth.
Observability
What went well?
- Automated alerting surfaced the influx of sessionstore service errors soon after onset.
- Responders were able to identify the (proximal) cause and mitigate within ~ 30m of the first page.
What went poorly?
- There was no automated alerting in place for sessionstore Cassandra node disk utilization and / or growth rate that would have given us the opportunity to intervene before the start of user impact.
- It was, and continues to be, challenging to understand how various factors contributed to the elevated sessionstore write rates (workload changes, SUL3 migration, etc.).
- The mitigation for this (or a future) disk exhaustion incident (truncation) is itself disruptive, though less so than a sustained edit outage.
- The incident occurred during Batphone hours (early Monday UTC), which contributed to delayed response.
Where did we get lucky?
- Responders, including one with expertise in Cassandra, were available to work on the incident despite the awkward time of day.
Links to relevant documentation
Actionables
- Page when disk utilization trend becomes unreasonable (task T390630)
Done - Increase sessionstore storage capacity (task T391544)
- Sessionstore namespacing (task T392170)
- Sessionstore workload observability (task T392182)
- Identify the cause of high rate of storage growth in sessionstore Cassandra (task T390514)
- Alert (non-paging) on Cassandra error rate(?)
- Page/alert on Cassandra nodes entering a downed state(?)
Scorecard
| Question | Answer
(yes/no) |
Notes | |
|---|---|---|---|
| People | Were the people responding to this incident sufficiently different than the previous five incidents? | No | |
| Were the people who responded prepared enough to respond effectively | Yes | ||
| Were fewer than five people paged? | No | Baphone page | |
| Were pages routed to the correct sub-team(s)? | No | ||
| Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | No | ||
| Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | Yes | |
| Was a public wikimediastatus.net entry created? | Yes | ||
| Is there a phabricator task for the incident? | Yes | ||
| Are the documented action items assigned? | Yes | ||
| Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | Yes | ||
| Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | Yes | |
| Were the people responding able to communicate effectively during the incident with the existing tooling? | Yes | ||
| Did existing monitoring notify the initial responders? | Yes | ||
| Were the engineering tools that were to be used during the incident, available and in service? | Yes | ||
| Were the steps taken to mitigate guided by an existing runbook? | No | ||
| Total score (count of all “yes” answers above) | 10 | ||



