Incidents/20180222-wdqs

From Wikitech

Summary

Around 2018-02-22 21:00 UTC, wdqs1004 loss network access. A reboot fixed the issue, but it crashed again shortly after. The server has been depooled pending further investigations.

2 short spikes in HTTP 5xx rate can be seen on grafana, but the failover seemed to have work well and we had minimal user facing impact.

Timeline

Taken from task T188045, times in UTC

  • 21:07 < icinga-wm> PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100%
  • 21:08 < gehel> I'm on the console on wdqs1004, it looks reasonnably well except that I cant reach network
  • 21:11 < gehel> !log powercycling wdqs1004 (complete loss of network)
  • 21:13 < icinga-wm> RECOVERY - Check size of conntrack table on wdqs1004 is OK: OK: nf_conntrack is 0 % full
  • 21:13 < icinga-wm> RECOVERY - Host wdqs1004 is UP: PING WARNING - Packet loss = 64%, RTA = 0.22 ms
  • 21:14 < icinga-wm> RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.031 second response time
  • 21:23 < XioNoX> mutante: any idea what caused that wdqs1004 issue?
  • 21:34 < icinga-wm> PROBLEM - configured eth on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
  • 21:34 < icinga-wm> PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
  • 21:34 < mutante> XioNoX: there are java errors but all that stuff seems like red herrings and normal before the incident as well.. all i can really see is it ... stopped working
  • 21:36 < XioNoX> Network interface Carrier transitions: 649, there is definitively something wrong with that host
  • 21:36 < icinga-wm> PROBLEM - Blazegraph process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
  • 21:36 < icinga-wm> PROBLEM - puppet last run on wdqs1004 is CRITICAL: Return code of 255 is out of bounds
  • 21:37 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet
  • 21:37 < icinga-wm> RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set
  • 21:37 < icinga-wm> RECOVERY - DPKG on wdqs1004 is OK: All packages OK
  • 21:41 < icinga-wm> PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused

Conclusions

This looks like a hardware issue, we will investigate and follow up.

Actionables

  • Investigation is tracked on phab:T188045.
  • Note that the LDF service had to be re-routed to another node manually. This is a known issue, tracked as phab:T161240.