Incidents/2022-09-15 sessionstore quorum issues

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-09-15 sessionstore quorum issues	Start	2022-09-15 12:25
Task		End	2022-09-15 12:38
People paged	0	Responder count	2
Coordinators	hnowlan	Affected metrics/SLOs	No SLO exists. centrallogin and session loss metrics spiked as a result
Impact	Roughly 125 login errors per minute, 215 session loss errors.

During routine maintenance on a sessionstore Cassandra node, the Kask service developed a split brain issue where it did not recognise the newly re-added Cassandra node in the cluster despite it being healthy from a Cassandra perspective. This lead to Kask falsely failing to get read and write quorum on the cluster, leading to a disruption of session-related functions on the wikis.

This issue was ultimately thought to be a result of an outstanding bug in gocql that leads to an inconsistent view of cluster health.

The initial mitigation of this issue consisted of removing the newly imaged host from the cluster. The medium-term fix to regain stability consisted of depooling eqiad (where the initial host was situated), restarting the sessionstore1001 Cassandra instance, ensuing that the cluster was healthy, restarting Kask and then re-pooling eqiad.

Timeline

SAL context

All times in UTC.

11:27 hnowlan completes reimage of sessionstore1001 to buster.
12:25 sessionstore1001 reimage to buster completes, cassandra started
12:25 OUTAGE BEGINS
12:25 sessionstore begins to fail to achieve write quorum
12:37 Reports of ongoing issues with sessions and logins in #wikimedia-operations
12:38 hnowlan stops cassandra-a on sessionstore1001, errors subside
12:38 OUTAGE ENDS
13:00 Incident opened. Hnowlan becomes IC.
15:17 hnowlan depools sessionstore in eqiad
15:22 hnowlan reenables and restarts cassandra on sessionstore1001
15:27 hnowlan runs a roll-restart of kask/sessionstore in eqiad
15:28 hnowlan repools eqiad - no errors in mediawiki or kask service logs

Detection

The issue was first human-detected and announced by Tamzin in #wikimedia-operations. However, some alerts also announced the degradation:

12:31 <+icinga-wm> PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/
Kask
12:35 <+icinga-wm> PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-
alerts?panelId=3&fullscreen&orgId=1
12:40 <+icinga-wm> PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13

None of these alerts paged. The errors seen in Cassandra logs were along the lines of the following, and were generated by Kask rather than any Cassandra-level cluster inconsistencies:

DEBUG [prometheus-http-1-1] 2022-09-15 12:26:35,180 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.64.48.178,10.64.32.85,10.192.16.95

Conclusions

What went well?

Once the issue was identified, resolution was relatively quick (disabling of the restored cassandra instance)
Being able to depool eqiad easily and without concerns as regards capacity in codfw for testing and switchover.

What went poorly?

Logging for Cassandra and Kask was confusing to debug. Cassandra's logs verge between terse and extremely verbose so it was hard to discern signal and noise. Kask was the source of the errors but the split brain it had developed internally manifested in the Cassandra logs despite there being nothing wrong with the Cassandra cluster.
Having a single host per Cassandra rack leaves us with a lack of flexibility in edge cases like these. This was also seen when undertaking recent PDU maintenance. This is not an issue of capacity but one of availability.
This event occurred just before the WMF SRE Summit and during a leadership offsite, meaning that there were limited responders on hand.
Either the detected 100% session loss or the spiking centralauth errors should probably have paged.

Where did we get lucky?

Early human error reporting

Links to relevant documentation

There is limited documentation on Kask debugging and behaviour.

Actionables

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)
People	Were the people responding to this incident sufficiently different than the previous five incidents?	Yes
	Were the people who responded prepared enough to respond effectively	Yes
	Were fewer than five people paged?	no
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the incident status section actively updated during the incident?	Yes
	Was the public status page updated?	No
	Is there a phabricator task for the incident?	No
	Are the documented action items assigned?	Yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	Yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	No
	Were the people responding able to communicate effectively during the incident with the existing tooling?	Yes
	Did existing monitoring notify the initial responders?	No
	Were the engineering tools that were to be used during the incident, available and in service?	Yes
	Were the steps taken to mitigate guided by an existing runbook?	No
Total score (count of all “yes” answers above)		8