User:Jobo~labswiki/runbook test/BGP status
Decision Tree
Scenario 1: BGP status
BGP peerings are connections between routers to exchange routing information. Each organisation has an AS (autonomous system) number assigned, and we classify our peerings from any router based on the ASN of the other side. The BGP status check fires if any BGP we consider critical (internal or external), goes down.
DECISION POINT - Alert Severity
- If warning/yellow, go to 1.1
- If If critical/red, go to 1.2
1.1 Warning/yellow severity
Yellow warnings are usually not urgent, follow up can wait till next working day.
- Follow Peering management#Managing down sessions to triage and deal with the issue.
1.2 Critical/red severity
STEP: Find out the type of BGP session that went down.
BGP sessions will either be internal between Wikimedia managed devices, or between us and an external provider.
- Find the ASN in the in the alert
- Look this up in the nagios config to find the related carrier or internal bgp group name.
DECISION POINT - Internal or External BGP session
- If external provider go to 2.1
- If internal go to 2.2
2.1 External provider
STEP: Open carrier page in Netbox
Look up the provider's page in Netbox
DECISION POINT - Does Netbox list provider
- If the provider's page is found on Netbox go to 2.3
- If it's Hurricane Electric / AS6939 then go to 2.4
2.3 Provider in Netbox
STEP: Find circuit information in Netbox
In Netbox look at the circuits we have from that provider at the site. It should be possible to work out which of these has the problem based on the device/router the alert was raised against.
If you click into the circuit it will give you the router interface it is connected to on our side in the 'termination info.
STEP: Troubleshoot as if interface was down
Follow the instructions on router interface down runbook, from '1.2 External 3rd party provider link' onwards, treating it as if the related interface was down.
2.4 Hurricane Electric / AS6939
STEP: Troubleshoot IXP status
This provider is slightly different from the others, as we do not peer with them on dedicated circuits, but instead at shared internet exchange points. In that case the issue may be the IX interface is down, in which case again follow the router interface down runbook. Otherwise it could be a problem with HE, but this is likely non-critical, please open a Phabricator task for netops to investigate when available.
2.2 Internal BGP session
STEP: Find type of internal session
There are two broad categories of internal BGP sessions in our network:
- BGP sessions between core networking devices, switches etc. themselves.
- BGP sessions between servers and core network devices
The name of the BGP group from the nagios config is the clearest indicator of this. Sessions to servers include PyBal, Anycast, kubernetes-* and aux-k8s*.
DECISION POINT - Problem between routers or from host to router?
- If the down BGP session is between two of our network devices go to 3.1
- If the down BGP session is between a network device and server go to 3.2
3.1 BGP Session down between network devices
STEP: Troubleshoot network BGP problem
Follow the instructions on router interface down runbook, from '1.1 Internal Link' onwards, treating it as if the related interface was down.
3.2 BGP Session down between network device and server
STEP: Troubleshoot host BGP problem
There are various reasons this could happen. Most internal services will fail over but we should still troubleshoot
- Talk to the SRE team who maintain the particular end-host alert relates to, and check if they are doing any maintenance
- Open a high-priority phabricator ticket, tagging 'netops', with as much info as possible on the alert