Incidents/2022-07-12 codfw A5 powercycle

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-07-12 codfw A5 powercycle	Start	2022-07-12 15:45:00
Task	T309957	End	2022-07-12 16:00:00
People paged	26	Responder count	9
Coordinators	Brandon Black	Affected metrics/SLOs
Impact	No apparent user-facing impact, but lots of internal clean up, e.g. for Ganeti VMs.

During the scheduled maintenance to upgrade the PDUs in rack A5, CyrusOne flipped the incorrect breaker on the breaker panel, prior to pulling the PDU's power cord out from its circuit. This resulted in all servers in rack A5 losing power to both its primary and secondary power feeds. The affected hardware in rack A5 booted back up, once CyrusOne realized the mistake and flipped the breaker back on.

15:45 <+icinga-wm> PROBLEM - Host graphite2003 #page is DOWN: PING CRITICAL - Packet loss = 100%
15:45 <+icinga-wm> PROBLEM - Host maps2005 is DOWN: PING CRITICAL - Packet loss = 100%
15:55 <+icinga-wm> PROBLEM - MariaDB read only s8 #page on db2079 is CRITICAL: Could not connect to localhost:3306
15:56 <+icinga-wm> PROBLEM - MariaDB read only m1 #page on db2132 is CRITICAL: Could not connect to localhost:3306
..
16:00 <+icinga-wm> RECOVERY - MariaDB read only s8 #page on db2079 is OK

Actionable

As a remediation item, the remaining PDU maintenances at codfw will no longer be hot-swapped with live equipment. After 3 of 5 inadvertent incidents, Dc-Ops team deemed it would be safer to coordinate with the SREs for hard downtime for all the affected servers in each rack. This would result in the graceful shutdown of affected servers, and allow for a shorter duration to complete each PDU upgrade with a temporary hired contractor.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)
People	Were the people responding to this incident sufficiently different than the previous five incidents?	no
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	no
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the incident status section actively updated during the incident?	no
	Was the public status page updated?	no
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	no
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		7

Summary

Actionable

See also

Scorecard