Jump to content

Incidents/2025-07-11 WMCS Ceph issues causing Toolforge and Cloud VPS failures

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-07-11 WMCS Ceph issues causing Toolforge and Cloud VPS failures Start 2025-07-11 07:13
Task T399281 End 2025-07-11 18:33
People paged 0 Responder count 5
Coordinators Francesco Negri, Andrew Bogott Affected metrics/SLOs No relevant SLOs exist
Impact Toolforge tools and many Cloud VPS VMs were intermittently unavailable throughout the day. The longest consecutive downtime was about 1 hour. The cumulative downtime was about 3 hours.
  • Toolforge tools were not responding to http requests (Tools-proxy-9 was returning an error page)
  • We found that Ceph had intermittent issues since last night, after some hosts were upgraded to Bookworm
  • This caused intermittent issues to both Toolforge and Cloud VPS
  • We downgraded back to Bullseye the 6 hosts that were previously upgraded to Bookworm

Timeline

All times in UTC.

01:59 PROBLEM - SSH on cloudcephosd1035 is CRITICAL

02:08 RECOVERY - SSH on cloudcephosd1035 is OK

04:20 SSH on cloudcephosd1036 is CRITICAL

05:02 RECOVERY - SSH on cloudcephosd1036 is OK

05:28 PROBLEM - SSH on cloudcephosd1037 is CRITICAL

05:31 RECOVERY - SSH on cloudcephosd1037 is OK

07:13 many cloud-vps hosts reported down by prometheus, but nobody notices

07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL

07:30 RECOVERY - SSH on cloudcephosd1037 is OK

07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops

07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra

07:20 cloud-vps back to normal (no hosts reported down)

07:21 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 779 slow ops

07:24 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra

07:27 FIRING: CephSlowOps: Ceph cluster in eqiad has 908 slow ops

07:32 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 908 slow ops

08:08 many cloud-vps hosts again reported down

08:10 FIRING: CephSlowOps: Ceph cluster in eqiad has 1678 slow ops

08:13 PROBLEM - SSH on cloudcephosd1036 is CRITICAL

08:18 wmcs-dnsleaks fails on cloudcontrol1007 (possibly unrelated)

08:18 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra

08:19 cloud-vps back to normal (no hosts reported down)

08:20 Manuel reports switchmaster.toolforge.org is down

08:23 RESOLVED: WidespreadInstanceDown: Widespread instances down in project cloudinfra

08:23 FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)

08:27 lucas.werkmeister@wikimedia.de reports all tools are returning an error from tools-proxy-9

08:39  <lucaswerkmeister> I can SSH into tools-proxy-9, the only failed systemd unit is logrotate which judging by the journal has been broken for a long time, probably not related

08:44 <lucaswerkmeister> I think tools-proxy-9 times out trying to reach k8s.tools.eqiad1.wikimedia.cloud in turn

08:44 <lucaswerkmeister> I can SSH into that one too, no high load there either

08:52  Incident opened.  Francesco Negri becomes IC.

08:55 Toolforge is working again. No action was taken.

08:58 RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4)

09:01 Incident is resolved (temporarily)

09:25 Francesco Negri starts wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-77, tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-37, as they were alerting with “many processes in D state”

09:28 SSH on cloudcephosd1036 is OK

09:39 PROBLEM - SSH on cloudcephosd1035 is CRITICAL

09:42 RECOVERY - SSH on cloudcephosd1035 is OK

10:20 PROBLEM - SSH on cloudcephosd1036 is CRITICAL

10:21 FIRING: CephSlowOps: Ceph cluster in eqiad has 847 slow ops

10:24 RECOVERY - SSH on cloudcephosd1036 is OK

10:26 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1386 slow ops

10:28 FIRING: CephSlowOps: Ceph cluster in eqiad has 5134 slow ops

10:33 RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1272 slow ops

11:17 <andrewbogott> I'm still half asleep and haven't read the backscroll, but my emails suggest that ceph pacific + bookworm + ceph traffic is a bad combination.

11:18 <andrewbogott> So probably the fix for this is for me to downgrade those hosts back to bullseye.

11:23 <andrewbogott> The bookworm hosts are 1006-1008,1035-1037

11:26 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037

11:35 SSH on cloudcephosd1008 is CRITICAL

11:41 SSH on cloudcephosd1008 is OK

11:41 FIRING: WidespreadInstanceDown

11:46 RESOLVED: WidespreadInstanceDown

12:13 SSH on cloudcephosd1035 is CRITICAL

12:17 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037

12:18 FIRING: WidespreadInstanceDown

12:20 SSH on cloudcephosd1035 is OK

12:20 cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1037

12:23 RESOLVED: WidespreadInstanceDown

12:44 cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037

13:11 cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host cloudcephosd1037

14:12: cloudcephosd1013 starts having drive issues (sdj disappears, and the os hangs momentarily) Jul 11 14:12:32 cloudcephosd1013 kernel: INFO: task md2_raid1:668 blocked for more than 120 seconds. , all osds crash and restart (see logs)

14:12 FIRING: WidespreadInstanceDown

14:24 Reopening the incident

14:29 <dhinus> things seem to get worse after 14:12 UTC

14:30 <andrewbogott> 1007 is frozen right now. So we /do/ have two [ceph hosts] down at once, which could maybe explain current bad behavior.

14:41 <dhinus> we have now 9 OSDs down (compared to 16 before) – [one ceph host recovered]

14:58 (Slack) https://wikipedialibrary.wmflabs.org/ is down right now too.

15:08 (Slack) Seeing failures on catalyst environments too

15:10 Many Cloud VPS VMs are down (29% of VMs in project “tools”)

15:14 <dhinus> many cloud vps VMs are still not working, and are not recovering for <reasons>

15:16 <dhinus> ceph IOPS are at about 50% of what they were this morning

15:16 <andrewbogott> ok, 1037 is up, now there's just a bit of pg shuffling to do before we can stop another host

15:16 <dhinus> do you know why we still have 1 OSD down?

15:18 <andrewbogott> I just checked, that's on 1013 which as far as I know hasn't suffered any recent maintenance. The down OSD is associated with a volume that doesn't appear in lsblk so... a mystery but /probably/ an unrelated one.

15:19 <andrewbogott> ceph is still recovering, down to 514 pgs

15:23 <dhinus> I tried manually rebooting a couple of VMs, and they do come back... but it will take a looooong time if we need to reboot all manually

15:29 <dhinus> count(up{job="node"} == 0) is finally looking good – All VMs are now reporting as healthy

15:34 <+wm-bb> <Vincent> My tool is up and running now, thank you :)

15:47 <+wm-bb> <Yetkin> My tool is up and running as well 😊

15:50 Handing off IC to Andrew Bogott

16:09 alertmanager is complaining about OpenstackAPIResponse, slow response times only for designate-api

16:11 <andrewbogott> I'm restarting designate services

17:25 Finished reimaging of cloudcephosd1035. Remaining Bookworm OSD nodes are cloudcephosd100[6-8].

18:33 All OSDs back to Bullseye, ceph shows no misplaced objects.

18:33 Incident closed

Detection

Users reported that several Toolforge tools were down.

No page was sent, but several non-paging alerts fired:

  • 07:14 PROBLEM - SSH on cloudcephosd1037 is CRITICAL
  • 07:16 FIRING: CephSlowOps: Ceph cluster in eqiad has 1451 slow ops
  • 07:19 FIRING: WidespreadInstanceDown: Widespread instances down in project cloudinfra

Conclusions

What went well?

  • No data corruption
  • Most VMs that went down did self-recover after Ceph was working again
  • People from other teams with relevant Ceph experience were available to help

What went poorly?

  • Two out of four SREs in the WMCS team were in PTO
  • When the incident started, only Francesco was online, and he had limited information on the current status of the in-progress Ceph upgrade from Bullseye to Bookworm
  • The incident was resolved too soon after things were looking stable, but then they got much worse
  • When Andrew correctly established that we needed to downgrade the hosts from Bookworm to Bullseye, the reimage cookbook took several attempts before working

Where did we get lucky?

  • The initial impact was intermittent, as out of 6 upgraded hosts, only 1 went down at a time
  • Incident happened during working hours

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively no
Were fewer than five people paged? yes No pages sent
Were pages routed to the correct sub-team(s)? N/A No pages sent
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. N/A No pages sent
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no The incident was resolved too soon, then reopened. The status remained "ongoing" until the following Monday.
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned?
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above)