User:Jobo~labswiki/runbook test/Router interface down
Decision Tree
Scenario 1: Router Interface Down
Example incident tracking beginning to end: https://phabricator.wikimedia.org/T314978
STEP: Gather information on affected interface from Icinga
Go to Icinga and look at the full alert, as not all the details make it into irc
Example alert:
CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Transport: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>
We need to extract relevant information from the above text, the part that interests us is between the <BR> tags. In this example:
- Interface name is
xe-3/2/3 - Description is
Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]- Type is
Core, other types are:Peering,Transit,OOB,Transport. - The circuit is operated by Zayo, with circuit ID listed after
- Optional: the other side of the link is
cr2-codfw:xe-5/0/1 - The remaining information are optional (latency, speed, cable#)
- Type is
DECISION POINT - Internal / External Circuit Type
We will take different actions depending on if the port that has gone down is connected to an external third party or if it is an internal connection between our own devices on site.
If the 'type' is Transport/Transit/Peering or OOB it is an external link, go to 1.1
If the 'type' is Core, and there is no provider listed, it is an internal link, go to 1.2
1.1 Internal link
STEP: Open a phabricator task for dc-ops
If we have an internal link down you should create a high-severity phabricator ticket and tag the local DCops team (eg. ops-eqiad, ops-codfw, etc). Include the alert and timestamp in the task.
STEP: Run the network debug cookbook
cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
1.2 External 3rd party provider link
STEP: Check if outage is planned or already acknowledged
In many cases an interface can be down due to a planned 3rd party provider maintenance. So firstly check for that circuit ID on the maintenance calendar
In other cases the issue may be emergency maintenance that had not been previously communicated, or due to a fault, but the carrier has already pro-actively acknowledged the issue. Check if we have an email to main-announce or noc from the carrier acknowledging the problem.
DECISION POINT - Do we need to contact carrier
- If it's scheduled or provider is aware of the incident go to 2.1
- If there has been no previous comms from the provider about it go to 2.2
2.1 Carrier is already aware
STEP: ACK the alert
In this case we can simply downtime/ACK the alert for the duration of the maintenance
STEP: Check other planned maintenances
It is a good idea to check the maintenance calendar for the expected duration of the current issue, and see if any related circuits might also suffer an outage during the same time. For instance if a transport link goes down to a POP often there is only one other transport circuit at that site, so if the other one is planned for an outage also there could be an issue. In that case we could depool the site or take other action. If not the incident has been dealt with and we just need to keep in contact with the carrier until service is restored.
2.2 No previous comms from carrier
STEP: Open a phabricator task for netops
Open a high-priority phabricator task, tag Netops, including the alert, timestamp and any other information gathered.
STEP: Run the network debug cookbook
cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
STEP: Contact the provider
At this point we need to raise the outage with the carrier. Use the information present in Netbox to determine how to contact the provider. In most cases this can be done via email, which should always have noc@wikimedia.org cc'd. If it is only possible via the carrier portal you may need to have dc-ops or netops do it if you don't already have access.
Include the circuit ID, time when the outage started and the output of the sre.network.debug cookbook when reporting the problem.
STEP: Monitor for recovery
Once reported we should keep an eye on the mailbox and service status. If there has been no response from the carrier NOC within 30 minutes we should instead call them directly, perhaps they did not get the report.
STEP: ACK the alert
We should ACK the alert once it has been handled, if the carrier has given a provisional fix time that can be used to set the duration, otherwise set it for 24 hours and extend as required.