Jump to content

User:Jobo~labswiki/runbook test/Router interface down

From Wikitech

Decision Tree

Scenario 1: Router Interface Down

Example incident tracking beginning to end: https://phabricator.wikimedia.org/T314978

STEP: Gather information on affected interface from Icinga

Go to Icinga and look at the full alert, as not all the details make it into irc

Example alert:

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Transport: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

We need to extract relevant information from the above text, the part that interests us is between the <BR> tags. In this example:

  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are: Peering, Transit, OOB, Transport.
    • The circuit is operated by Zayo, with circuit ID listed after
    • Optional: the other side of the link is cr2-codfw:xe-5/0/1
    • The remaining information are optional (latency, speed, cable#)

DECISION POINT - Internal / External Circuit Type

We will take different actions depending on if the port that has gone down is connected to an external third party or if it is an internal connection between our own devices on site.

If the 'type' is Transport/Transit/Peering or OOB it is an external link, go to 1.1

If the 'type' is Core, and there is no provider listed, it is an internal link, go to 1.2

STEP: Open a phabricator task for dc-ops

If we have an internal link down you should create a high-severity phabricator ticket and tag the local DCops team (eg. ops-eqiad, ops-codfw, etc). Include the alert and timestamp in the task.

STEP: Run the network debug cookbook

cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface

STEP: Check if outage is planned or already acknowledged

In many cases an interface can be down due to a planned 3rd party provider maintenance. So firstly check for that circuit ID on the maintenance calendar

In other cases the issue may be emergency maintenance that had not been previously communicated, or due to a fault, but the carrier has already pro-actively acknowledged the issue. Check if we have an email to main-announce or noc from the carrier acknowledging the problem.

DECISION POINT - Do we need to contact carrier

  • If it's scheduled or provider is aware of the incident go to 2.1
  • If there has been no previous comms from the provider about it go to 2.2

2.1 Carrier is already aware

STEP: ACK the alert

In this case we can simply downtime/ACK the alert for the duration of the maintenance

STEP: Check other planned maintenances

It is a good idea to check the maintenance calendar for the expected duration of the current issue, and see if any related circuits might also suffer an outage during the same time. For instance if a transport link goes down to a POP often there is only one other transport circuit at that site, so if the other one is planned for an outage also there could be an issue. In that case we could depool the site or take other action. If not the incident has been dealt with and we just need to keep in contact with the carrier until service is restored.

2.2 No previous comms from carrier

STEP: Open a phabricator task for netops

Open a high-priority phabricator task, tag Netops, including the alert, timestamp and any other information gathered.

STEP: Run the network debug cookbook

cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface

STEP: Contact the provider

At this point we need to raise the outage with the carrier. Use the information present in Netbox to determine how to contact the provider. In most cases this can be done via email, which should always have noc@wikimedia.org cc'd. If it is only possible via the carrier portal you may need to have dc-ops or netops do it if you don't already have access.

Include the circuit ID, time when the outage started and the output of the sre.network.debug cookbook when reporting the problem.

STEP: Monitor for recovery

Once reported we should keep an eye on the mailbox and service status. If there has been no response from the carrier NOC within 30 minutes we should instead call them directly, perhaps they did not get the report.

STEP: ACK the alert

We should ACK the alert once it has been handled, if the carrier has given a provisional fix time that can be used to set the duration, otherwise set it for 24 hours and extend as required.