Incidents/2022-09-15 sessionstore quorum issues
document status: draft
Summary
Incident ID | 2022-09-15 sessionstore quorum issues | Start | 2022-09-15 12:25 |
---|---|---|---|
Task | End | 2022-09-15 12:38 | |
People paged | 0 | Responder count | 2 |
Coordinators | hnowlan | Affected metrics/SLOs | No SLO exists. centrallogin and session loss metrics spiked as a result |
Impact | Roughly 125 login errors per minute, 215 session loss errors. |
During routine maintenance on a sessionstore Cassandra node, the Kask service developed a split brain issue where it did not recognise the newly re-added Cassandra node in the cluster despite it being healthy from a Cassandra perspective. This lead to Kask falsely failing to get read and write quorum on the cluster, leading to a disruption of session-related functions on the wikis.
This issue was ultimately thought to be a result of an outstanding bug in gocql that leads to an inconsistent view of cluster health.
The initial mitigation of this issue consisted of removing the newly imaged host from the cluster. The medium-term fix to regain stability consisted of depooling eqiad (where the initial host was situated), restarting the sessionstore1001 Cassandra instance, ensuing that the cluster was healthy, restarting Kask and then re-pooling eqiad.
Timeline
All times in UTC.
- 11:27 hnowlan completes reimage of sessionstore1001 to buster.
- 12:25 sessionstore1001 reimage to buster completes, cassandra started
- 12:25 OUTAGE BEGINS
- 12:25 sessionstore begins to fail to achieve write quorum
- 12:37 Reports of ongoing issues with sessions and logins in #wikimedia-operations
- 12:38 hnowlan stops cassandra-a on sessionstore1001, errors subside
- 12:38 OUTAGE ENDS
- 13:00 Incident opened. Hnowlan becomes IC.
- 15:17 hnowlan depools sessionstore in eqiad
- 15:22 hnowlan reenables and restarts cassandra on sessionstore1001
- 15:27 hnowlan runs a roll-restart of kask/sessionstore in eqiad
- 15:28 hnowlan repools eqiad - no errors in mediawiki or kask service logs
Detection
The issue was first human-detected and announced by Tamzin in #wikimedia-operations. However, some alerts also announced the degradation:
12:31 <+icinga-wm> PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/
Kask
12:35 <+icinga-wm> PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-
alerts?panelId=3&fullscreen&orgId=1
12:40 <+icinga-wm> PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13
None of these alerts paged. The errors seen in Cassandra logs were along the lines of the following, and were generated by Kask rather than any Cassandra-level cluster inconsistencies:
DEBUG [prometheus-http-1-1] 2022-09-15 12:26:35,180 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.64.48.178,10.64.32.85,10.192.16.95
Conclusions
What went well?
- Once the issue was identified, resolution was relatively quick (disabling of the restored cassandra instance)
- Being able to depool eqiad easily and without concerns as regards capacity in codfw for testing and switchover.
What went poorly?
- Logging for Cassandra and Kask was confusing to debug. Cassandra's logs verge between terse and extremely verbose so it was hard to discern signal and noise. Kask was the source of the errors but the split brain it had developed internally manifested in the Cassandra logs despite there being nothing wrong with the Cassandra cluster.
- Having a single host per Cassandra rack leaves us with a lack of flexibility in edge cases like these. This was also seen when undertaking recent PDU maintenance. This is not an issue of capacity but one of availability.
- This event occurred just before the WMF SRE Summit and during a leadership offsite, meaning that there were limited responders on hand.
- Either the detected 100% session loss or the spiking centralauth errors should probably have paged.
Where did we get lucky?
- Early human error reporting
Links to relevant documentation
- There is limited documentation on Kask debugging and behaviour.
Actionables
- Expand documentation for Kask debugging
- Incorporate the upstream fix for this issue into Kask
- Alert on Kask error rate
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | Yes | |
Were the people who responded prepared enough to respond effectively | Yes | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes | ||
Process | Was the incident status section actively updated during the incident? | Yes | |
Was the public status page updated? | No | ||
Is there a phabricator task for the incident? | No | ||
Are the documented action items assigned? | Yes | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | Yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
No | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | Yes | ||
Did existing monitoring notify the initial responders? | No | ||
Were the engineering tools that were to be used during the incident, available and in service? | Yes | ||
Were the steps taken to mitigate guided by an existing runbook? | No | ||
Total score (count of all “yes” answers above) | 8 |