Incidents/2023-01-24 sessionstore quorum issues

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-01-24 sessionstore quorum issues Start 20:55
Task T327815 End 21:15
People paged 5 Responder count 6
Coordinators Brett Cornwall Affected metrics/SLOs No SLO exists. centrallogin and session loss metrics spiked as a result
Impact Wikipedia's session storage suffered an outage of about 15 minutes (eqiad). This caused users to be unable to log in or edit pages.

Session storage is provided by an HTTP service (Kask) that uses Cassandra for persistence. As part of routine maintenance, one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. While the host was down, connections were removed (de-pooled) by Kask, and requests rerouted to the remaining two, as expected. However, once the host rejoined the cluster, clients that selected sessionstore1001 as coordinator encountered errors (an inability to achieve LOCAL_QUORUM consistency).

This is likely (at least) similar to Incidents/2022-09-15 sessionstore quorum issues (if not in fact, the same issue).

Timeline

All times in UTC.

...
{"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"ea9a0eef-256d-4eb1-bfd5-863a66aacee9"}
{"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"58ad97ee-0025-4701-b288-4df39a38eb8a"}
...
...
INFO  [StorageServiceShutdownHook] 2023-01-24 20:50:26,403 Server.java:179 - Stop listening for CQL clients
INFO  [main] 2023-01-24 20:54:52,649 Server.java:159 - Starting listening for CQL clients on /10.64.0.144:9042 (encrypted)...
...
  • 20:50 urandom reboots sessionstore1001.eqiad.wmnet (task T325132)
  • 20:54 Cassandra on sessionstore1001 comes back online; Successful wiki edits drop to below half of expected number (OUTAGE BEGINS)
  • 20:57 TheresNoTime notices drop in successful wiki edits, mentions in #wikimedia-operations
  • 21:07 urandom rolling restarts sessionstore service (Kask)
  • 21:09 Manual Critical page in VictorOps from taavi
  • 21:10 Successful wiki edits climb back up to expected levels (OUTAGE ENDS)
  • 21:14 De-pooling eqiad suggested but not yet executed
  • 21:15 urandom notices cessation of issues, announces to #wikimedia-operations

Detection

Monitoring did not alert; A manual page was issued once TheresNoTime noticed an issue/Users started reporting issues:

21:57:42 <TheresNoTime> Successful wiki edits has just started to drop, users reported repeated "loss of session data" persisting a refresh

As alerts did not fire, manual debugging was used to determine the issue at hand. It took little time to determine that SessionStorage was the issue.

Conclusions

What went well?

  • Once the issue was identified, resolution was relatively quick

What went poorly?

  • The dramatic detected session loss should have probably paged automatically

Where did we get lucky?

  • TheresNoTime was able to catch the issue before any users/alerting systems alerted us.

Links to relevant documentation

  • There is limited documentation on Kask debugging and behaviour.

Actionables

  • Setup notifications for elevated 500 error rate (sessionstore) (task T327960) Yes Done
  • Notifications from service error logs(?) (task T320401) Yes Done
  • Determine root cause of unavailable errors (i.e. "cannot achieve consistency level") (task T327954) Yes Done
  • De-pool datacenter prior to hosts reboots (as an interim to properly fixing the connection pooling) Yes Done

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively no Other people handled the issue.
Were fewer than five people paged? no Alerts did not fire
Were pages routed to the correct sub-team(s)? no Alerts did not fire
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no Alerts did not fire
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no There was no Google doc
Was a public wikimediastatus.net entry created? yes https://www.wikimediastatus.net/incidents/05cntb1k1myb
Is there a phabricator task for the incident? yes phab:T327815
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? no Same failure mode but manifesting in a different way
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. no There weren't open tasks as such, the failure mode was known and considered fixed
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 5