document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2023-01-10 eqsin network outage	Start	2023-01-10 16:00:00
Task	T328354	End	2023-01-10 20:57
People paged	Batphone	Responder count	5
Coordinators	adenisse	Affected metrics/SLOs
Impact	Users in Asia were affected for ~11 to 41 minutes

…

eqsin is connected to the core DCs via two transport links, one of them has been suffering a long fiber cut (see T322529) the other one went down due to a planned maintenance from the transport provider.

For ~11min (+ the time user's DNS resolvers pick up eqsin depool, long tail up to 30min) users normally redirected to eqsin (mostly in the APAC region) were only able to read Wikipedia pages already cached in eqsin.

Timeline

Dec 22, 2022:

16:06 UTC: Planned Work PWIC225900 Notification from Arelion

Jan 9, 2022:

16:06 UTC: Reminder for Planned Work PWIC225900 from Arelion

Jan 10, 2022:

16:00: Service Window for PWIC225900 starts

16:37: <+icinga-wm> PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status

16:44: <+icinga-wm> PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status

16:45 AM <+icinga-wm> PROBLEM - Host bast5002 is DOWN: PING CRITICAL - Packet loss = 100%

16:45 AM <+icinga-wm> PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%

16:45 AM <+icinga-wm> PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host netflow5002 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host cr2-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host durum5002 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host cr3-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100%

16:46 AM <+icinga-wm> PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100%

16:47 AM <+icinga-wm> PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%

16:47 AM <+icinga-wm> PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%

16:48 AM <bblack> !log depooling eqsin from DNS

16:49 AM <+icinga-wm> PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

16:50 AM <+icinga-wm> PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

16:50 AM <+icinga-wm> PROBLEM - Host cr3-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

16:50 AM <+icinga-wm> PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

16:55 AM <+icinga-wm> RECOVERY - Host netflow5002 is UP: PING OK - Packet loss = 0%, RTA = 247.25 ms

16:55 AM <+icinga-wm> RECOVERY - Host durum5002 is UP: PING OK - Packet loss = 0%, RTA = 238.90 ms

16:55 AM <+icinga-wm> RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 244.81 ms

16:55 AM <+icinga-wm> RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 242.79 ms

16:55 AM <+icinga-wm> RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 232.47 ms

16:55 AM <+icinga-wm> RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 233.62 ms

16:55 AM <+icinga-wm> RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 250.70 ms

16:55 AM <+icinga-wm> RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 231.49 ms

16:55 AM <+icinga-wm> RECOVERY - Host cr2-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 225.39 ms

16:55 AM <+icinga-wm> RECOVERY - Host cr3-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 245.89 ms

16:55 AM <+icinga-wm> RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 253.59 ms

16:55 AM <+icinga-wm> RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.01 ms

16:55 AM <+icinga-wm> RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 231.29 ms

16:55 AM <+icinga-wm> RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 245.15 ms

16:56 AM <+icinga-wm> RECOVERY - Host cr3-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 243.03 ms

16:56 AM <+icinga-wm> RECOVERY - Host bast5002 is UP: PING OK - Packet loss = 0%, RTA = 254.35 ms

16:56 AM <+icinga-wm> RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 251.84 ms

16:56 AM <+icinga-wm> RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 237.02 ms

16:57 AM <+icinga-wm> RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status

17:00 AM <+icinga-wm> PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

8:33 UTC: repooling

Detection

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Automated monitoring

Copy the relevant alerts that fired in this section.

16:37 AM <+icinga-wm> PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status

16:44 AM <+icinga-wm> PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status