Incidents/2025-07-11 WMCS Ceph issues causing Toolforge and Cloud VPS failures
document status: draft
Summary
| Incident ID | 2025-07-11 WMCS Ceph issues causing Toolforge and Cloud VPS failures | Start | 2025-07-11 07:13 |
|---|---|---|---|
| Task | T399281 | End | 2025-07-11 18:33 |
| People paged | 0 | Responder count | 5 |
| Coordinators | Francesco Negri, Andrew Bogott | Affected metrics/SLOs | No relevant SLOs exist |
| Impact | Toolforge tools and many Cloud VPS VMs were intermittently unavailable throughout the day. The longest consecutive downtime was about 1 hour. The cumulative downtime was about 3 hours. | ||
- Toolforge tools were not responding to http requests (Tools-proxy-9 was returning an error page)
- We found that Ceph had intermittent issues since last night, after some hosts were upgraded to Bookworm
- This caused intermittent issues to both Toolforge and Cloud VPS
- We downgraded back to Bullseye the 6 hosts that were previously upgraded to Bookworm
Timeline
All times in UTC.
01:59 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
02:08 RECOVERY - SSH on cloudcephosd1035 is OK
04:20 SSH on cloudcephosd1036 is CRITICAL
05:02 RECOVERY - SSH on cloudcephosd1036 is OK
05:28 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
05:31 RECOVERY - SSH on cloudcephosd1037 is OK
07:13 many cloud-vps hosts reported down by prometheus, but nobody notices
07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
07:30 RECOVERY - SSH on cloudcephosd1037 is OK
07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:20 cloud-vps back to normal (no hosts reported down)
07:21 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 779 slow ops
07:24 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
07:27 FIRING: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
07:32 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 908 slow ops
08:08 many cloud-vps hosts again reported down
08:10 FIRING: CephSlowOps: Ceph cluster in eqiad has 1678 slow ops
08:13 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
08:18 wmcs-dnsleaks fails on cloudcontrol1007 (possibly unrelated)
08:18 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:19 cloud-vps back to normal (no hosts reported down)
08:20 Manuel reports switchmaster.toolforge.org is down
08:23 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra
08:23 FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
08:27 lucas.werkmeister@wikimedia.de reports all tools are returning an error from tools-proxy-9
08:39 <lucaswerkmeister> I can SSH into tools-proxy-9, the only failed systemd unit is logrotate which judging by the journal has been broken for a long time, probably not related
08:44 <lucaswerkmeister> I think tools-proxy-9 times out trying to reach k8s.tools.eqiad1.wikimedia.cloud in turn
08:44 <lucaswerkmeister> I can SSH into that one too, no high load there either
08:52 Incident opened. Francesco Negri becomes IC.
08:55 Toolforge is working again. No action was taken.
08:58 RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)
09:01 Incident is resolved (temporarily)
09:25 Francesco Negri starts wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37, as they were alerting with “many processes in D state”
09:28 SSH on cloudcephosd1036 is OK
09:39 PROBLEM - SSH on cloudcephosd1035 is CRITICAL
09:42 RECOVERY - SSH on cloudcephosd1035 is OK
10:20 PROBLEM - SSH on cloudcephosd1036 is CRITICAL
10:21 FIRING: CephSlowOps: Ceph cluster in eqiad has 847 slow ops
10:24 RECOVERY - SSH on cloudcephosd1036 is OK
10:26 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1386 slow ops
10:28 FIRING: CephSlowOps: Ceph cluster in eqiad has 5134 slow ops
10:33 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1272 slow ops
11:17 <andrewbogott> I'm still half asleep and haven't read the backscroll, but my emails suggest that ceph pacific + bookworm + ceph traffic is a bad combination.
11:18 <andrewbogott> So probably the fix for this is for me to downgrade those hosts back to bullseye.
11:23 <andrewbogott> The bookworm hosts are 1006-1008,1035-1037
11:26 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
11:35 SSH on cloudcephosd1008 is CRITICAL
11:41 SSH on cloudcephosd1008 is OK
11:41 FIRING: WidespreadInstanceDown
11:46 RESOLVED: WidespreadInstanceDown
12:13 SSH on cloudcephosd1035 is CRITICAL
12:17 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
12:18 FIRING: WidespreadInstanceDown
12:20 SSH on cloudcephosd1035 is OK
12:20 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037
12:23 RESOLVED: WidespreadInstanceDown
12:44 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037
13:11 cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037
14:12: cloudcephosd1013 starts having drive issues (sdj disappears, and the os hangs momentarily) Jul 11 14:12:32 cloudcephosd1013 kernel: INFO: task md2_raid1:668 blocked for more than 120 seconds.
, all osds crash and restart (see logs)
14:12 FIRING: WidespreadInstanceDown
14:24 Reopening the incident
14:29 <dhinus> things seem to get worse after 14:12 UTC
14:30 <andrewbogott> 1007 is frozen right now. So we /do/ have two [ceph hosts] down at once, which could maybe explain current bad behavior.
14:41 <dhinus> we have now 9 OSDs down (compared to 16 before) – [one ceph host recovered]
14:58 (Slack) https://wikipedialibrary.wmflabs.org/ is down right now too.
15:08 (Slack) Seeing failures on catalyst environments too
15:10 Many Cloud VPS VMs are down (29% of VMs in project “tools”)
15:14 <dhinus> many cloud vps VMs are still not working, and are not recovering for <reasons>
15:16 <dhinus> ceph IOPS are at about 50% of what they were this morning
15:16 <andrewbogott> ok, 1037 is up, now there's just a bit of pg shuffling to do before we can stop another host
15:16 <dhinus> do you know why we still have 1 OSD down?
15:18 <andrewbogott> I just checked, that's on 1013 which as far as I know hasn't suffered any recent maintenance. The down OSD is associated with a volume that doesn't appear in lsblk so... a mystery but /probably/ an unrelated one.
15:19 <andrewbogott> ceph is still recovering, down to 514 pgs
15:23 <dhinus> I tried manually rebooting a couple of VMs, and they do come back... but it will take a looooong time if we need to reboot all manually
15:29 <dhinus> count(up{job="node"} == 0) is finally looking good – All VMs are now reporting as healthy
15:34 <+wm-bb> <Vincent> My tool is up and running now, thank you :)
15:47 <+wm-bb> <Yetkin> My tool is up and running as well 😊
15:50 Handing off IC to Andrew Bogott
16:09 alertmanager is complaining about OpenstackAPIResponse, slow response times only for designate-api
16:11 <andrewbogott> I'm restarting designate services
17:25 Finished reimaging of cloudcephosd1035. Remaining Bookworm OSD nodes are cloudcephosd100[6-8].
18:33 All OSDs back to Bullseye, ceph shows no misplaced objects.
18:33 Incident closed
Detection
Users reported that several Toolforge tools were down.
No page was sent, but several non-paging alerts fired:
- 07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
- 07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
- 07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra
Conclusions
What went well?
- No data corruption
- Most VMs that went down did self-recover after Ceph was working again
- People from other teams with relevant Ceph experience were available to help
What went poorly?
- Two out of four SREs in the WMCS team were in PTO
- When the incident started, only Francesco was online, and he had limited information on the current status of the in-progress Ceph upgrade from Bullseye to Bookworm
- The incident was resolved too soon after things were looking stable, but then they got much worse
- When Andrew correctly established that we needed to downgrade the hosts from Bookworm to Bullseye, the reimage cookbook took several attempts before working
Where did we get lucky?
- The initial impact was intermittent, as out of 6 upgraded hosts, only 1 went down at a time
- Incident happened during working hours
Links to relevant documentation
Actionables
- phab:T399858 Cloud Ceph misbehaving on Debian Bookworm
- phab:T399870 Add paging alert when many tools are unreachable
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
| Question | Answer
(yes/no) |
Notes | |
|---|---|---|---|
| People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
| Were the people who responded prepared enough to respond effectively | no | ||
| Were fewer than five people paged? | yes | No pages sent | |
| Were pages routed to the correct sub-team(s)? | N/A | No pages sent | |
| Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | N/A | No pages sent | |
| Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | The incident was resolved too soon, then reopened. The status remained "ongoing" until the following Monday. |
| Was a public wikimediastatus.net entry created? | no | ||
| Is there a phabricator task for the incident? | yes | ||
| Are the documented action items assigned? | |||
| Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
| Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | |
| Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
| Did existing monitoring notify the initial responders? | no | ||
| Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
| Were the steps taken to mitigate guided by an existing runbook? | no | ||
| Total score (count of all “yes” answers above) | |||