Incident documentation/20120806-Fibercut

From Wikitech
Jump to: navigation, search

Outage Summary

Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa, bypassing the Ashburn site and failing over services to Tampa data center.

  • Duration: From about 13:15 UTC to 14:18 UTC; approximately 63 minutes
  • Impact: Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
  • Cause: Fiber cut - resulting in network connectivity loss
  • Resolution: Fail-over network traffic and services from Ashburn to Tampa data center

Detail

At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider in Tampa, they informed us that a third party crew working in the Tampa area inflicted a fiber cut and caused the outage.
The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
To provide network redundancy, we engaged our Tampa provider to supply us two independent and segregated DWDM systems to deliver the Wikimedia services. Each of these DWDM systems is routed over diverse fibers using the dual entrances into the carrier's Tampa POP, making the design capable of delivering the two diversely routed 10G waves to Wikimedia as long as the metro segment of wave #1 and long haul segment of wave #1 are on the same route.
Our provider, after performing initial troubleshooting of the Wikimedia waves, revealed the location of the unexpected alarms. It occurred in a section of a folded fiber segment through which both of the unprotected Wikimedia services traversed.  The fiber cable damage occurred within this folded fiber segment causing the loss of service. The post outage investigation also showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. That fiber cut and the incorrect routing resulted in lost of the two network connections between our data centers.
We have since asked the provider to audit our waves to ensure such single point of failure is no longer in their system. We are also in the midst of replicating and migrating the rest of our backend services to Eqiad, creating full service redundancy in the two data-centers. The plan is to complete that work in Q2.