Jump to content

User:Jobo~labswiki/runbook test/BGP status

From Wikitech

Decision Tree

Scenario 1: BGP status

BGP peerings are connections between routers to exchange routing information. Each organisation has an AS (autonomous system) number assigned, and we classify our peerings from any router based on the ASN of the other side. The BGP status check fires if any BGP we consider critical (internal or external), goes down.

DECISION POINT - Alert Severity

  • If warning/yellow, go to 1.1
  • If If critical/red, go to 1.2

1.1 Warning/yellow severity

Yellow warnings are usually not urgent, follow up can wait till next working day.

1.2 Critical/red severity

STEP: Find out the type of BGP session that went down.

BGP sessions will either be internal between Wikimedia managed devices, or between us and an external provider.

  • Find the ASN in the in the alert
  • Look this up in the nagios config to find the related carrier or internal bgp group name.

DECISION POINT - Internal or External BGP session

  • If external provider go to 2.1
  • If internal go to 2.2

2.1 External provider

STEP: Open carrier page in Netbox

Look up the provider's page in Netbox

DECISION POINT - Does Netbox list provider

  • If the provider's page is found on Netbox go to 2.3
  • If it's Hurricane Electric / AS6939 then go to 2.4

2.3 Provider in Netbox

STEP: Find circuit information in Netbox

In Netbox look at the circuits we have from that provider at the site. It should be possible to work out which of these has the problem based on the device/router the alert was raised against.

If you click into the circuit it will give you the router interface it is connected to on our side in the 'termination info.

STEP: Troubleshoot as if interface was down

Follow the instructions on router interface down runbook, from '1.2 External 3rd party provider link' onwards, treating it as if the related interface was down.

2.4 Hurricane Electric / AS6939

STEP: Troubleshoot IXP status

This provider is slightly different from the others, as we do not peer with them on dedicated circuits, but instead at shared internet exchange points. In that case the issue may be the IX interface is down, in which case again follow the router interface down runbook. Otherwise it could be a problem with HE, but this is likely non-critical, please open a Phabricator task for netops to investigate when available.

2.2 Internal BGP session

STEP: Find type of internal session

There are two broad categories of internal BGP sessions in our network:

  • BGP sessions between core networking devices, switches etc. themselves.
  • BGP sessions between servers and core network devices

The name of the BGP group from the nagios config is the clearest indicator of this. Sessions to servers include PyBal, Anycast, kubernetes-* and aux-k8s*.

DECISION POINT - Problem between routers or from host to router?

  • If the down BGP session is between two of our network devices go to 3.1
  • If the down BGP session is between a network device and server go to 3.2

3.1 BGP Session down between network devices

STEP: Troubleshoot network BGP problem

Follow the instructions on router interface down runbook, from '1.1 Internal Link' onwards, treating it as if the related interface was down.

3.2 BGP Session down between network device and server

STEP: Troubleshoot host BGP problem

There are various reasons this could happen. Most internal services will fail over but we should still troubleshoot

  • Talk to the SRE team who maintain the particular end-host alert relates to, and check if they are doing any maintenance
  • Open a high-priority phabricator ticket, tagging 'netops', with as much info as possible on the alert