Incidents/2025-11-11 WMCS toolsdb primary down
document status: in-review
Summary
| Incident ID | 2025-11-11 WMCS toolsdb primary down | Start | 2025-11-11 13:30 |
|---|---|---|---|
| Task | T409922 | End | 2025-11-11 13:49 |
| People paged | 1 | Responder count | 2 |
| Coordinators | Filippo Giunchedi | Affected metrics/SLOs | No relevant SLOs exist |
| Impact | ToolsDB unavailable for all tools for about 20 minutes.
Some transactions committed just before the crash might have been lost, although replication was flowing until the crash, so it should have replicated most or all of the committed transactions. | ||
ToolsDB crashed while Francesco was investigating the previous crash from a few days before (Incidents/2025-11-05 toolsdb primary out of space).
Apparently this crash was unrelated, although I suspect that the fallback from the previous crash (with several gigabytes of undo logs added to the "ibdata1" file) might have increased the load and played some role in this crash.
Timeline
13:30 <dhinus> toolsdb just crashed while I was testing some config options (very minor ones)
13:30 <dhinus> and it's failing to restart, expect alerts/pages
13:34 <dhinus> I will fail over to tools-db-6 as I was already planning to do it anyway
13:37 Incident opened. Filippo becomes IC.
13:37 <dhinus> I'm following the guide at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary
13:39 <dhinus> the first crash log is Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch>
13:40 <dhinus> there's a previous one actually Nov 11 13:20:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:20:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last check>
13:43 <dhinus> https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/281
13:46 <dhinus> running tofu apply with cookbook
13:47 <dhinus> setting read_only=off on tools-db-6 (the new primary)
13:49 <dhinus> I see write transactions are already happening on the new host
13:49 <dhinus> `sql tools` from a toolforge bastion also works
Detection
Francesco was logged in to the server when it crashed, so he noticed immediately.
Conclusions
What went well?
- Failover to the replica was relatively quick (about 10 minutes) and painless.
What went poorly?
- If we had been quick with failing over to the replica host after the previous crash, this might have been avoided. At the same time, waiting was done on purpose because we were not completely sure the previous issue was resolved, and we didn't want it to happen again after the failover.
Where did we get lucky?
- The crash happened during working hours and when Francesco was available and he was able to perform the failover quickly, as he wrote the failover procedure and performed it several times in the past.
Links to relevant documentation
Actionables
- T409922 Crash recovery should not fail
- T409890 Pt-heartbeat hiera key (profile::wmcs::services::toolsdb::primary_server) should follow the DNS name to detect toolsdb primary (i.e. automate this step)
Scorecard
| Question | Answer
(yes/no) |
Notes | |
|---|---|---|---|
| People | Were the people responding to this incident sufficiently different than the previous five incidents? | no | |
| Were the people who responded prepared enough to respond effectively | yes | ||
| Were fewer than five people paged? | yes | ||
| Were pages routed to the correct sub-team(s)? | yes | ||
| Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes | ||
| Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | |
| Was a public wikimediastatus.net entry created? | no | ||
| Is there a phabricator task for the incident? | yes | ||
| Are the documented action items assigned? | yes | ||
| Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
| Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | no | |
| Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
| Did existing monitoring notify the initial responders? | no | ||
| Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
| Were the steps taken to mitigate guided by an existing runbook? | yes | ||
| Total score (count of all “yes” answers above) | |||