Jump to content

Incidents/2025-11-11 WMCS toolsdb primary down

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-11-11 WMCS toolsdb primary down Start 2025-11-11 13:30
Task T409922 End 2025-11-11 13:49
People paged 1 Responder count 2
Coordinators Filippo Giunchedi Affected metrics/SLOs No relevant SLOs exist
Impact ToolsDB unavailable for all tools for about 20 minutes.

Some transactions committed just before the crash might have been lost, although replication was flowing until the crash, so it should have replicated most or all of the committed transactions.

ToolsDB crashed while Francesco was investigating the previous crash from a few days before (Incidents/2025-11-05 toolsdb primary out of space).

Apparently this crash was unrelated, although I suspect that the fallback from the previous crash (with several gigabytes of undo logs added to the "ibdata1" file) might have increased the load and played some role in this crash.

Timeline

13:30 <dhinus> toolsdb just crashed while I was testing some config options (very minor ones)

13:30 <dhinus> and it's failing to restart, expect alerts/pages

13:34  <dhinus> I will fail over to tools-db-6 as I was already planning to do it anyway

13:37  Incident opened.  Filippo becomes IC.

13:37  <dhinus> I'm following the guide at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary

13:39  <dhinus> the first crash log is Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [ERROR] [FATAL] InnoDB:  innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch>

13:40  <dhinus> there's a previous one actually Nov 11 13:20:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:20:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last check>

13:43  <dhinus> https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/281

13:46  <dhinus> running tofu apply with cookbook

13:47  <dhinus> setting read_only=off on tools-db-6 (the new primary)

13:49  <dhinus> I see write transactions are already happening on the new host

13:49  <dhinus> `sql tools` from a toolforge bastion also works

Detection

Francesco was logged in to the server when it crashed, so he noticed immediately.

Conclusions

What went well?

  • Failover to the replica was relatively quick (about 10 minutes) and painless.

What went poorly?

  • If we had been quick with failing over to the replica host after the previous crash, this might have been avoided. At the same time, waiting was done on purpose because we were not completely sure the previous issue was resolved, and we didn't want it to happen again after the failover.

Where did we get lucky?

  • The crash happened during working hours and when Francesco was available and he was able to perform the failover quickly, as he wrote the failover procedure and performed it several times in the past.

Actionables

  • T409922 Crash recovery should not fail
  • T409890 Pt-heartbeat hiera key (profile::wmcs::services::toolsdb::primary_server) should follow the DNS name to detect toolsdb primary (i.e. automate this step)

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. no
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? yes
Total score (count of all “yes” answers above)