Incidents/20190115-PDU-fuses-blown

From Wikitech

Summary

On Jan 15 2019 a partial power outage affected one PDU on the A3 rack in the eqiad datacenter.

Timeline (all times are in UTC):

  • 18:24 PDU side B on A3 rack in eqiad lost power to at least one phase (not clear yet)
  • 18:40 m1-master and m2-master CNAMEs changed manually in the authdns: dbproxy1001 -> dbproxy1006, dbproxy1002 -> dbproxy1007
  • 18:43 debmonitor was restarted
  • 18:43 gerrit was restarted
  • 18:52 DNS was updated with the usual procedure to include the manual change done above

Hosts that went down

  • cloudservices1004: passive host of a HA couple, no impact
  • cp1008: traffic testing host, no user impact
  • dbproxy1001: m1-master proxy
  • dbproxy1002: m2-master proxy
  • dbstore1003: new host not yet provisioned, no impact
  • elastic1030: ElasticSearch is redundant to host failure no manual intevention needed and no impact
  • elastic1031: ElasticSearch is redundant to host failure no manual intevention needed and no impact
  • prometheus1003: Prometheus host together with prometheus1004, will most likely have a hole in the data for the period it will be down. AFAIK backfill has never been supported by design in Prometheus
  • restbase1016: Host was having already issues and was not in production, see https://phabricator.wikimedia.org/T212418

All the other hosts in that rack are reporting the loss of dual power.

Services impacted

  • m1-master.eqiad (dbproxy1001, which redirects traffic for (https://wikitech.wikimedia.org/wiki/MariaDB/misc#m1 ):
    • Bacula
    • Etherpad
  • m2-master.eqiad (dbproxy1002), which redirects traffic for (https://wikitech.wikimedia.org/wiki/MariaDB/misc#m2 ):
    • gerrit (because it requires dbproxy) partially database backend has moved to notesdb but not completely, in 2.16 it could theoretically all move (group memberships are left in mysql) to make it independent of a dbproxy downtime. service was restarted but likely just to speed things up rather than required (DNS change to failover db master)
    • debmonitor: was restarted, but the CNAME was not yet updated so it required a second restart. Unsure if it would have been recovered automatically once the CNAME was propagated.
    • otrs
  • prometheus: will very likely have an intermittent hole in its data for this interval
    • prometheus has opinions [negative] about backfill
    • grafana is not smart enough to know to fill graphs from only the 'good' prometheus server for this interval (cf. phab:T215744)
    • ... nor is it currently possible in our setup to send queries to just a particular prometheus backend
  • fundraising "frimpressions" affected? noticed by cwd in fr monitoring (affected by changed CNAME of m1/m2 masters?) under investigation, unconfirmed ("likely cruft", cwd will make ticket to remove)

Next steps

Move s3 master from db1075 to another host (not in A2 or A3!) T213858 List all equipment + impact in rack a3 = https://netbox.wikimedia.org/dcim/racks/3/ (apart from what's already down) Manuel says relatively easy from a database perspective (only 1 host to depool) Unplug A3 Tower B and plug a spare PDU to see if there's power coming in from Equinix otherwise we tripped a breaker and should ask Equinix smarthands to restore (but only after replacing that PDU) Replace A2 fuse on Thursday Approve/purchase new PDU Chris to test both spare PDUs in A3, on A3 power side B circuit.


Useful links: - Full list of devices in Netbox: https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=5&rack_id=3

- list of hosts that have "Power_Supply Status: Critical [PS Redundancy = Critical" alerts in Icinga currently:

   These are "all the other hosts" mentioned at the end of "Hosts that went down".

-- not ACKed

   analytics1052 -> analytics1057
   analytics1059
   analytics1060 
   elastic1032 -> elastic1035
   ganeti1007
   graphite1003
   kubernetes1001
   rdb1005
   relforge1001
   restbase1010, restbase1011

Almost all of the above have started at times between 1.5 and 2.5 hours ago.

-- ACKed (these are probably older ongoing work)

   an-worker1078, an-worker1079
   analytics1037 
   cp3030 -> cp3039 
   kafka1013 
   kafka1022 
   ms-be1044, ms-be1045 

Operations log

[17:50:03] <icinga-wm>	 PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:50:39] <icinga-wm>	 PROBLEM - Host dbproxy1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:15] <icinga-wm>	 PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%

[17:51:31] <elukey>	 XioNoX: working on A3?

[17:51:52] <XioNoX>	 elukey: no, Chris said we lost one power phase

[17:52:14] <volans>	 and probably the mgmt switch went down

[17:52:37] <elukey>	 I can see
[17:52:38] <elukey>	 Jan 15 17:48:14  msw1-eqiad chassism[1399]: ifd_process_flaps IFD: ge-0/0/5, sent flap msg to RE, Downstate

[17:52:40] <XioNoX>	 volans: yep, that's the only impact I can see, right?

[17:53:21] <volans>	 XioNoX: so far yes, ps1-a3-eqiad 

[17:53:35] <icinga-wm>	 RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms
[17:55:21] <icinga-wm>	 RECOVERY - Host cloudservices1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms

[17:55:24] <XioNoX>	 afaik, no servers were harmed during that power outage

[17:55:57] <icinga-wm>	 RECOVERY - Host dbproxy1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms

[17:56:18] <cmjohnson1>	 no servers went down ...just the mgmt switch ...it's not redundant

[17:56:38] <XioNoX>	 yeah, no big deal

...

[18:17:13] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cloudservices1004 is CRITICAL: .., Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical]
[18:20:23] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dbproxy1002 is CRITICAL: .., Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical]

[18:23:35] <icinga-wm>	 PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:24:51] <icinga-wm>	 PROBLEM - Host cloudservices1004 is DOWN: PING CRITICAL - Packet loss = 100%
[18:25:27] <icinga-wm>	 PROBLEM - Host dbstore1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

[18:25:35] <paladox>	 hmm, oh gerrit's not working

[18:25:47] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org on debmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:25:53] <icinga-wm>	 PROBLEM - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:26:35] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1060 is CRITICAL: .., Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]

[18:26:53] <XioNoX>	 we're having power issues in a rack

[18:27:24] <jynus>	 I am going to switchover dbproxy1001 and 2

[18:28:49] <jynus>	 of course gerrit won't work

[18:28:53] <icinga-wm>	 PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

[18:28:59] <jynus>	 bblack: can we update dns manually?

[18:30:26] <volans>	 gehel: FYI elastic103[01] affected

[18:31:07] <gehel>	 volans: reading back...

[18:32:13] <gehel>	 volans: something else than the .mgmt interfaces? I'm lost

[18:32:43] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1053 is CRITICAL: .., Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[18:33:01] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1059 is CRITICAL: .., Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]

[18:33:07] <volans>	 gehel: yes I see the two hosts down

[18:34:06] <bblack>	 jynus: update dns manually?

[18:34:11] <jynus>	 yes
[18:34:18] <jynus>	 no gerrit available

[18:34:21] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1052 is CRITICAL: .., Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]

[18:34:25] <bblack>	 uh ok

[18:34:27] <jynus>	 m1-master and m2-master

[18:34:29] <bblack>	 what change do you need?

[18:34:38] <jynus>	 from 1001 and 1002 to dbproxy1006 and 7

[18:34:45] <bblack>	 (and why is gerrit borked too, but that can be dealt with after I guess)

[18:34:54] <jynus>	 will fix gerrit

[18:36:25] <bblack>	 ok working on things

[18:36:34] <volans>	 bblack: if you're taking over I'll stop

[18:36:50] <bblack>	 volans: was anything done alreadY/

[18:36:53] * volans was reconstructing the steps based on wikitech and the script

[18:36:54] <bblack>	 y? :)

[18:36:56] <volans>	 bblack: nope

[18:37:03] <jynus>	 so this is on the eqiad.wmnet zone

[18:37:04] <volans>	 was about to :)

[18:37:14] <jynus>	 CNAME m1-master
[18:37:17] <jynus>	 and m2-master

[18:38:16] <bblack>	 yeah fwiw, direct push to git master doesn't work either

[18:38:37] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1054 is CRITICAL: .., Power_Supply Status: Critical [PS Redundancy = Critical]

[18:38:52] <mark>	 edit locally on all and fix later?

[18:39:22] <jynus>	 yes please, I assumed you knew about the script more than us (authdns-update)

[18:39:51] <volans>	 my idea was to edit locally on all authdns and find the right script to regenerate the templates and reload gdsnd

[18:39:52] <jynus>	 or any dirty way to do it
[18:40:06] <jynus>	 and once gerrit is up, I can commit the proper fix

[18:40:07] <bblack>	 !log DNS manually updated for m1-master -> dbproxy1006 and m2-master -> dbproxy1007
[18:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

[18:40:08] <volans>	 just not totally sure about all the steps and sequence

[18:40:13] <jynus>	 thanks, bblack!

[18:40:20] <volans>	 would be nice to document that on wt lateer

[18:40:26] <jynus>	 (now waiting a bit to propagate)

[18:40:29] <bblack>	 it's "fixed" now, but it will get reverted by any authdns-update, so don't touch dns till we get past this

[18:40:50] <jynus>	 gerrit is back to me

[18:41:06] <bblack>	 volans: yeah there's always supposedly been an ability to do manual updates via local git clones an dpulling from each other, but in practice I think the fallout i smessy

[18:41:07] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: connect to address 10.64.32.177 and port 9001: Connection refused

[18:41:32] <jynus>	 mmm, although ssh interface still fails

[18:41:37] <bblack>	 volans: I just literally edited the live zonefiles on the 3 servers and did "gdnsdctl reload-zones" on each server, that's the least-fallout way

[18:41:50] <hashar>	 one probably need to restart gerrit, I would guess the jdbc driver is stuck with the old IP

[18:41:55] <volans>	 no need to run gen-zones?
[18:42:01] <volans>	 or how is called the other one

[18:42:13] <bblack>	 volans: no, I edited the template outputs, not the inputs

[18:42:24] <volans>	 ah got it

[18:42:24] <jynus>	 hashar: can you do that?

[18:42:25] <bblack>	 as in "vi /etc/gdnsd/zones/wmnet"

[18:42:35] <hashar>	 jynus: probably? :)

[18:42:37] <volans>	 yeah seems the less messy

[18:42:57] <icinga-wm>	 RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.004 second response time

[18:43:02] <volans>	 !log restarted debmonitor on debmonitor1001
[18:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

[18:43:08] <jynus>	 thanks for the restarts

[18:43:13] <jynus>	 etherpad should be next, checking

[18:43:15] <hashar>	 !log Restarting Gerrit to catch up with a DNS change with the database
[18:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

[18:43:19] <bblack>	 volans: in theory we can do a git commit on any authdns locally and have the others pull that around, but then we have to fix git history (and even then, something else has seemed "off" to me in that area, but we're getting way out in left field)

[18:43:34] <volans>	 ack

[18:43:35] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8188 bytes in 0.013 second response time

[18:43:57] <jynus>	 oh, I didn't touched etherpad service yet, someone did?

[18:44:02] <volans>	 mmmh I'm still not loading debmonitor from outside... I'll have a look

[18:44:06] <volans>	 jynus: not me

[18:44:15] <bblack>	 it's kind of on the backlog somewhere to re-investigate all of that a fix it and document a procedure that's known to work and has manageable and documented fallout/recoveyr :)

[18:44:17] <hashar>	 !log [2019-01-15 18:44:06,959] [main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.15.6-5-g4b9c845200 ready
[18:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

[18:44:23] <volans>	 jynus: but seems to work to me

[18:44:24] <jynus>	 volans: dns load balancing is not great :-)

[18:44:34] <chasemp>	 volans: confirmed on debmonitor down fyi

[18:44:48] <hashar>	 jynus: so yeah gerrit is back. The java connector does not seem to handle dns changes :/

[18:45:18] <jynus>	 hashar: can you review?
[18:45:22] <jynus>	 it asks for my user

[18:46:06] <hashar>	 jynus: yes i can review stuff ^^  :)

[18:46:12] <wikibugs>	 (PS1) BBlack: emergency m[12]-master changes [dns] - https://gerrit.wikimedia.org/r/484546

[18:46:26] <volans>	 ok I did restart it too early and it got the old CNAME, should work now debmonitor

[18:46:35] <jynus>	 thanks bblack

[18:46:47] <bblack>	 that's just catching up the git repo so it doesn't revert the local changes

[18:46:55] <jynus>	 the downtime may have messed up my config

[18:46:55] <bblack>	 I'm not going to run authdns-update yet till we're a little more stable

[18:47:23] <jynus>	 oh, and review now works again

[18:47:25] <hashar>	 ah https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html "The JVM caches DNS name lookups" ...
[18:47:44] <hashar>	 but i would guess that JDBC driver has its own issue of some sort

[18:47:47] * volans wished django would have logged the IP of Can't connect to MySQL server on 'm2-master.eqiad.wmnet'

[18:47:50] <jynus>	 the dns switchover was supposed to be a temporary fix, better than having nothing
[18:48:02] <jynus>	 (as happened in the past)

[18:48:45] <icinga-wm>	 PROBLEM - IPMI Sensor Status on restbase1010 is CRITICAL: .., Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical]

[18:48:49] <jynus>	 so lets confirm potentially affected services
[18:49:21] <jynus>	 bacula etherpad librenms racktables rt

[18:49:42] <hashar>	 gerrit, debmonitor, otrs, iegreview, wikimania scholarship  (from a quick grep)
[18:49:54] <hashar>	 that is matches for m2-master in puppet

[18:49:54] <volans>	 debmonitor done fwiw

[18:49:55] <jynus>	 gerrit otrs debmonitor frimpressions iegreview scholarships

[18:50:19] <jynus>	 does anyone have otrs access and can confirm it works?

[18:50:33] <hashar>	 I don't anymore :(

[18:50:41] <Reedy>	 tzatziki: ^

Action items

  • No page was sent although multiple internal important services were affected, review the page policy of those services and evaulate if any additional check should be added
  • Document an emergency procedure on wikitech to safely deploy DNS and puppet changes when gerrit is unavailable (SCAP ones are more or less trivial)
  • dbproxy failover shouldn't require manual DNS modifications
  • PDU-exported power metrics should wind up in Prometheus somehow, whether that be via libremns or snmp_exporter
  • Need our own docs re: PDUs, phases, fuse ratings, replacing fuses, etc
  • Upgrade gerrit to a version that does not need a DB, make gerrit HA https://phabricator.wikimedia.org/T200739
  • Make Phabricator HA https://phabricator.wikimedia.org/T190572
  • Ensure that dbproxy hosts are spread across racks