Incidents/2024-06-11 WMCS Ceph
document status: draft
Summary
Incident ID | 2024-06-11 WMCS Ceph | Start | 2024-06-11 14:58 |
---|---|---|---|
Task | T367191 | End | 2024-06-11 15:23 |
People paged | 1 | Responder count | 6 |
Coordinators | taavi | Affected metrics/SLOs | WMCS services do not have SLOs, so no relevant SLOs exist. |
Impact | For about 25 minutes Cloud VPS and all services hosted on it (incl. Toolforge) were completely inaccessible. |
A faulty optic on one of the fiber links between WMCS racks caused packet loss between nodes in the Cloud VPS Ceph storage cluster. This made writes any writes stall since Ceph could not confirm those writes had been committed on all nodes they were supposed to on. Cloud VPS VMs cannot handle their storage hanging like this and stalled too, which made any services hosted on Cloud VPS inaccessible for the duration of the incident.
Timeline
All times in UTC.
- Starting from ~2024-06-10 23:00, there's an increase on errors reported on cloudsw1-d5-eqiad:et-0/0/53 (connected to cloudsw1-f4-eqiad:et-0/0/54). As well as linking hosts in d5 and f4 (TODO: and possibly E4 and F4? need to check), this is the active link used to connect the stretched cloud-instances VLAN to the F4 rack. These errors are happening in bursts, until..
- 2024-06-11 14:58:09 First alert: <+jinxer-wm> FIRING: CephSlowOps: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps OUTAGE STARTS
- 15:01:50 First page: <+icinga-wm_> PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
- ~15:02 Taavi notices the alerts during the Toolforge monthly meeting and asks David to investigate. That meeting is subsequently postponed by a week.
- 15:04 David pings Cathal on IRC about a possible network issue. Cathal is in the middle of an another switch upgrade but starts looking
- 15:09 Taavi declares an incident and becomes an IC
- 15:10 Meet room is opened for incident response coordination
- 15:15 David sets the Ceph cluster in norebalance mode to prevent the cluster from moving things around for now
- 15:18 Alert for OOM killer activating on cloudcephmon1001. TODO: not sure for the impact of this?
- 15:20 Arturo runs script on all Ceph nodes to try to determine patterns to pin down the issue. TODO: was this succesful in locating the issue?
- 15:2x Cathal notices high numbers of errors on the affected interfaces, and disables TODO: the interfaces? BGP? to move traffic to other links. This isn't immediately communicated to the WMCS team debugging the issue on the Meet room. OUTAGE ENDS
- 15:23 Alerts start recovering.
- 15:26 Cathal moves the cloud-instances VLAN links to E4 and F4 from D5 to C8
- 15:50 Taavi starts cookbook to reboot all Toolforge NFS-enabled workers nodes
- 16:40-17:10 DC-Ops replaces faulty optic
Detection
Automated alerting noticed the issue - the first alert was a warning that pointed towards a Ceph issue of some sort at 14:58:09, and a page was sent out from a toolschecker NFS alert about four minutes later (15:01:50). The first human report arrived on IRC several minutes later:
15:08:35 <Lucas_WMDE> it looks like there might be connection issues at the moment? I can’t connect to tools via HTTPS nor SSH
The initial alerts located the issue well, although they were followed by a high volume of generic "VPS instance" down alerts (on IRC and via email).
TODO: did metricsinfra or its meta check send any pages? if not, why?
Conclusions
What went well?
- Automated monitoring noticed the issue fast and provided useful pointers where to look
- Ceph handled the network degradation relatively well and quickly recovered once traffic was shifted to alternative links
- After previous Toolforge NFS issues, the tooling built for recovering from those (by restarting worker nodes) worked very well
What went poorly?
- WMCS team's methods of finding the affected interfaces was not very efficient - the LibreNMS graphs were much more helpful but we did not know where to find them
- Netops were in the middle of an another switch update and communication between WMCS and Netops wasn't very efficient
- Taavi's declaration of an incident and him being IC got lost in the IRC chatter for some people, leading to initially duplicate efforts
Where did we get lucky?
- People with relevant knowledge were around and available quickly
Links to relevant documentation
- …
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
- Done T367199 Replace the faulty optic
- Figure out if we should alert for interface errors like this one
- Done phab:T367336 Add both sides of the links to discard/error graphs for router connectivity to the ceph dashboards
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | ||
Were the people who responded prepared enough to respond effectively | |||
Were fewer than five people paged? | |||
Were pages routed to the correct sub-team(s)? | |||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | |||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | ||
Was a public wikimediastatus.net entry created? | |||
Is there a phabricator task for the incident? | |||
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | |||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
||
Were the people responding able to communicate effectively during the incident with the existing tooling? | |||
Did existing monitoring notify the initial responders? | |||
Were the engineering tools that were to be used during the incident, available and in service? | |||
Were the steps taken to mitigate guided by an existing runbook? | |||
Total score (count of all “yes” answers above) |