SONiC/VXLAN-EVPN Network Testing - Sonic on Dell switches
This page details the individual tests carried out to validate the Dell hardware / Sonic OS platform. For further information please see the Dell Enterprise Sonic Evaluation write up.
Test Topology
This test was done on 4 Dell switches. Two S5232F devices acting as 'Spines' as two S5248F's operating as Leaf/edge switches. All are based on the Broadcom Trident 3 ASIC and ran Dell SONiC-OS-3.4.1-Enterprise_Base. We have 1 server connected to leaf1 on port Ethernet 0 and the other server connected to leaf 2 on port Eth1/1.
Physical Tests
Transceiver Compatibility
FS.com 40GBase-PSM4 QSFP+
dell-spine1# show interface transceiver Ethernet 4 Ethernet4 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- connector-type : MPO date-code : 2020-08-13 form-factor : QSFP+ cable-length(m) : 0 display-name : QSFP+ 40GBASE-PSM4 max-module-power(Watts) : 3.5 max-port-power(Watts) : 5 vendor-oui : 44-7C-7F present : PRESENT serial-no : F1930364698 vendor : FS vendor-part : QSFP-PLR4-40G vendor-rev : 1B
Finisar 10GBase-LR SFP+
dell-leaf2# show interface transceiver Ethernet 0 Ethernet0 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- connector-type : LC date-code : 2014-07-19 form-factor : SFP+ cable-length(m) : 0 display-name : SFP+ 10GBASE-LR max-module-power(Watts) : 2 max-port-power(Watts) : 2.5 vendor-oui : 00-90-65 present : PRESENT serial-no : AS32MX4 vendor : FINISAR CORP. vendor-part : FTLX1471D3BCL vendor-rev : A
FS.com 10GBase-LR SFP+
dell-leaf2# show interface transceiver Ethernet 1 Ethernet1 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- connector-type : LC date-code : 2018-05-04 form-factor : SFP+ cable-length(m) : 0 display-name : SFP+ 10GBASE-LR max-module-power(Watts) : 2 max-port-power(Watts) : 2.5 vendor-oui : 00-90-65 present : PRESENT serial-no : G1804036447 vendor : FiberStore vendor-part : SFP-10GLR-31 vendor-rev : A
Juniper 10G 3m DAC 10GBASE-CR-DAC-3.0M
dell-leaf2# show interface transceiver Ethernet 2 Ethernet2 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- date-code : 2011-11-29 form-factor : SFP+ cable-length(m) : 3 display-name : SFP+ 10GBASE-CR-DAC-3.0M max-module-power(Watts) : 2 max-port-power(Watts) : 2.5 vendor-oui : 41-50-48 present : PRESENT serial-no : APF11460020W1Y vendor : Amphenol vendor-part : 584990002 vendor-rev : A
FS.com 10G 3m DAC
dell-leaf2# show interface transceiver Ethernet 4 Ethernet4 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- date-code : 2019-06-12 form-factor : SFP+ cable-length(m) : 3 display-name : SFP+ 10GBASE-CR-DAC-3.0M max-module-power(Watts) : 2 max-port-power(Watts) : 2.5 vendor-oui : 00-00-00 present : PRESENT serial-no : G1906269417-1 vendor : FS vendor-part : SFPP-PC03 vendor-rev : A
Juniper 40GBase-SR4 QSFP+
dell-spine1# show interface transceiver Ethernet 12 Ethernet12 --------------------------------------------------------------------- Attribute : Value/State --------------------------------------------------------------------- connector-type : MPO date-code : 2013-12-04 form-factor : QSFP+ cable-length(m) : 0 display-name : QSFP+ 40GBASE-SR4 max-module-power(Watts) : 1.5 max-port-power(Watts) : 5 vendor-oui : 00-17-6A present : PRESENT serial-no : QD491487 vendor : AVAGO vendor-part : AFBR-79EQDZ-JU1 vendor-rev : 01
FS.com SFP 1000BASE-T
The switch is able to see the transceiver but no way to test since the switch doesn't allow to set the interface to 1G.
Ethernet32 SFP 1000BASE-T 1.5 | 2.5 Fiberstore G1804219349 SFP-GB-GE-T N/A N/A Ready
To set the interface to 1G, you need to change the portgroup first from 25G to 10G, then change the port speed to 1G:
sudo config portgroup speed 9 10000 dell-leaf1(conf-if-Ethernet32)# speed 1000 interface Ethernet32 description test_srv1_1GTBase mtu 9100 speed 1000 fec none no shutdown
dell-leaf1# show interface Ethernet 32 Ethernet32 is up, line protocol is up Hardware is Eth Description: test_srv1_1GTBase Mode of IPV4 address assignment: not-set Mode of IPV6 address assignment: not-set Interface IPv6 oper status: Disabled IP MTU 9100 bytes LineSpeed 1GB, Auto-negotiation off FEC: DISABLED Last clearing of "show interface" counters: never 10 seconds input rate 0 packets/sec, 0 bits/sec, 0 Bytes/sec 10 seconds output rate 0 packets/sec, 0 bits/sec, 0 Bytes/sec Input statistics: 28 packets, 2072 octets 28 Multicasts, 0 Broadcasts, 0 Unicasts 0 error, 28 discarded, 0 Oversize 0 Packets (128 to 255 Octects) Output statistics: 2379 packets, 620868 octets 2379 Multicasts, 0 Broadcasts, 0 Unicasts 0 error, 0 discarded, 0 Oversize
dell-leaf1# show mac address-table interface Ethernet 32 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2004 D0:94:66:86:4D:6C DYNAMIC Ethernet32
'Breakout' Transceivers
4x10G QSFP+ in 'Breakout' Mode
The switch is able to see the break out cable on both ends. the version used is PSM4
Ethernet0 QSFP+ 40GBASE-PSM4 3.5 | 5.0 FS F1930364699 QSFP-PLR4-40G N/A N/A Ready Ethernet16 SFP+ 10GBASE-LR 2.0 | 2.5 FINISAR CORP. AP40JAN FTLX1471D3BCL N/A N/A Ready
Command for break out
sudo config interface breakout Ethernet0 '4x10G' After running Logic to limit the impact Final list of ports to be deleted : { "Ethernet0": "100000" } Final list of ports to be added : { "Ethernet2": "10000", "Ethernet3": "10000", "Ethernet0": "10000", "Ethernet1": "10000" } dell-spine1# show interface breakout ----------------------------------------------- Port Breakout Mode Status Interfaces ----------------------------------------------- 1/1 4x10G Completed Eth1/1/1 Eth1/1/2 Eth1/1/3 Eth1/1/4
4x25G QSFP28 in 'Brekaout' Mode
Not sure we really need this, be good to know if it's an option. I believe it can be done with 100GBase-PSM4 -> 4x25GBase-LR4:
https://community.fs.com/blog/brief-introduction-to-qsfp-100g-psm4-optical-transceiver.html
100GBase-SR4 should also work over MMF to 4x25GBase-SR
Or otherwise a DAC/AOC:
https://www.fs.com/de-en/products/116289.html
https://www.fs.com/products/70439.html
Transceiver DOM Supprt
Digital Optical Monitoring is the standard that transceiver modules use to send statistics to the element they are plugged into (i.e. the switch).
We need to validate that we can see this data on the Dell switches, particularly the Rx light level which is important for longer optical links.
SFP+ / SFP28
Works ok:
dell-leaf1# show interface transceiver dom Ethernet 0 ----------------------------------------------------------------------- Ethernet0 ----------------------------------------------------------------------- Identifier: SFP Vendor Name: FINISAR CORP. Vendor Part: FTLX1471D3BCL ChannelMonitorValues: Rx1Power: -2.0094 dBm Tx1Bias: 41.5540 mA Tx1Power: -1.5009 dBm ChannelThresholdValues: RxPowerHighAlarm : 2.5001 dBm RxPowerHighWarning: 2.0000 dBm RxPowerLowAlarm : -20.0000 dBm RxPowerLowWarning : -18.0134 dBm TxBiasHighAlarm : 85.0000 mA TxBiasHighWarning : 80.0000 mA TxBiasLowAlarm : 15.0000 mA TxBiasLowWarning : 20.0000 mA TxPowerHighAlarm : 2.0000 dBm TxPowerHighWarning: 0.9999 dBm TxPowerLowAlarm : -7.9997 dBm TxPowerLowWarning : -7.0006 dBm ModuleMonitorValues: Temperature: 36.6055 C Vcc: 3.3757 Volts ModuleThresholdValues: TempHighAlarm : 78.0000 C TempHighWarning: 73.0000 C TempLowAlarm : -13.0000 C TempLowWarning : -8.0000 C VccHighAlarm : 3.7000 Volts VccHighWarning : 3.6000 Volts VccLowAlarm : 2.9000 Volts VccLowWarning : 3.0000 Volts
QSFP+ / QSFP28
dell-spine1# show interface transceiver dom Eth1/1/1 ----------------------------------------------------------------------- Eth1/1/1 ----------------------------------------------------------------------- Identifier: QSFP+ Vendor Name: FS Vendor Part: QSFP-PLR4-40G ChannelMonitorValues: Rx1Power: -1.7289 dBm Rx2Power: -2.6898 dBm Rx3Power: -2.3980 dBm Rx4Power: -inf dBm
(note above is in breakout mode with only 3 ends connected, so makes sense Rx4Power is zero).
Functional Tests
L2 Access port
dell-leaf1# show running-configuration interface Ethernet 0 interface Ethernet0 description test_srv1 mtu 9100 speed 10000 fec none no shutdown switchport access Vlan 2004 dell-leaf1# show running-configuration interface Eth1/1 interface Eth1/1 description test_Srv2 mtu 9100 speed 10000 fec none no shutdown switchport access Vlan 2005
interface Vlan2004 description private1-e-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.64.254/22
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.80.254/22
Test server 1 in vlan 2004
ppaul@srv1:~$ ip -br link show enp59s0f0 enp59s0f0 UP 40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>
Mac is learnt on leaf 1 where test srv1 is connected to port 0, and shown in local Vlan table:
dell-leaf1# show mac address-table Vlan 2004 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2004 40:A8:F0:2C:83:10 DYNAMIC Ethernet0
As well as being associated with the attached EVPN VNI:
dell-leaf1# show evpn mac vni 102004 Number of MACs (local and remote) known for this VNI: 2 MAC Type Intf/Remote VTEP VLAN Seq #'s 00:00:00:10:10:10 local Vlan2004 2004 0/0 40:a8:f0:2c:83:10 local Ethernet0 2004 0/0
L2 Trunk port
Interface config:
dell-leaf1# show running-configuration interface Ethernet 0 ! interface Ethernet0 description test_srv1 mtu 9100 speed 10000 fec none no shutdown switchport trunk allowed Vlan 2004-2005
With the server configured with an 802.1q sub-interface for each Vlan MACs were learnt correctly on the switch:
dell-leaf1# show mac address-table interface Ethernet 0 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2004 40:A8:F0:2C:83:10 DYNAMIC Ethernet0 2005 40:A8:F0:2C:83:10 DYNAMIC Ethernet0
Routing to end host from Vlan interface in VRF
IPv4
Server connected via L2 access/trunk port. Switch config:
interface Vlan2004 description private1-e-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.64.254/22 ipv6 anycast-address 2620:0000:0861:011c::254/64
10.192.64.10 is the ip address of test server 1 if we want to ping this test server from leaf 1 we get:
dell-leaf1# ping vrf Vrf_codfw 10.192.64.10 ping: Warning: source address might be selected on device other than Vrf_codfw. PING 10.192.64.10 (10.192.64.10) from 10.192.64.254 Vrf_codfw: 56(84) bytes of data. 64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.312 ms 64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.302 ms ^C --- 10.192.64.10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1006ms
IPv6
The server is configured with IPv6 global unicast address 2620:0000:0861:011c::10, which we also get a response from:
dell-leaf1# ping vrf Vrf_codfw 2620:0000:0861:011c::10 ping6: Warning: source address might be selected on device other than Vrf_codfw. PING 2620:0000:0861:011c::10(2620:0:861:11c::10) from 2620:0:861:11c::254 Vrf_codfw: 56 data bytes 64 bytes from 2620:0:861:11c::10: icmp_seq=1 ttl=64 time=0.385 ms 64 bytes from 2620:0:861:11c::10: icmp_seq=2 ttl=64 time=0.305 ms
Routed port in VRF
Config added as follows to switch port:
interface Ethernet2 description srv1_nic2 mtu 9100 speed 10000 fec none no shutdown ip vrf forwarding Vrf_codfw ip address 192.0.2.8/31 ipv6 address 2620:0000:0861:011e::1/64
Server port IPs were set as follows:
root@srv1:~# ip -br addr show dev enp59s0f1 enp59s0f1 UP 192.0.2.9/31 2620:0:861:11e::2/64 fe80::42a8:f0ff:fe2c:8314/64
IPv4
IPv4 pingable both sides:
dell-leaf1# ping vrf Vrf_codfw -I 192.0.2.8 192.0.2.9 PING 192.0.2.9 (192.0.2.9) from 192.0.2.8 Vrf_codfw: 56(84) bytes of data. 64 bytes from 192.0.2.9: icmp_seq=1 ttl=64 time=0.265 ms 64 bytes from 192.0.2.9: icmp_seq=2 ttl=64 time=0.250 ms
root@srv1:~# ping 192.0.2.8 PING 192.0.2.8 (192.0.2.8) 56(84) bytes of data. 64 bytes from 192.0.2.8: icmp_seq=1 ttl=64 time=0.171 ms 64 bytes from 192.0.2.8: icmp_seq=2 ttl=64 time=0.185 ms
IPv6
Same with IPv6:
dell-leaf1# ping vrf Vrf_codfw -I 2620:0000:0861:011e::1 2620:0000:0861:011e::2 PING 2620:0000:0861:011e::2(2620:0:861:11e::2) from 2620:0:861:11e::1 Vrf_codfw: 56 data bytes 64 bytes from 2620:0:861:11e::2: icmp_seq=1 ttl=64 time=0.312 ms 64 bytes from 2620:0:861:11e::2: icmp_seq=2 ttl=64 time=0.247 ms
root@srv1:~# ping 2620:0000:0861:011e::1 PING 2620:0000:0861:011e::1(2620:0:861:11e::1) 56 data bytes 64 bytes from 2620:0:861:11e::1: icmp_seq=1 ttl=64 time=0.209 ms 64 bytes from 2620:0:861:11e::1: icmp_seq=2 ttl=64 time=0.214 ms
Jumbo frames across L2 Vlan
Same Switch
Remote via VXLAN
Jumbo frames L3 routing
Same Switch
Switch port had MTU set to 9100, Vlan interface did not have a specific MTU configured but it defaults to 9100:
interface Ethernet0 description test_srv1 mtu 9100 speed 10000 fec none no shutdown switchport trunk allowed Vlan 2004-2005
dell-leaf1# show running-configuration interface Vlan 2004 ! interface Vlan2004 description private1-e-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.64.254/22 ipv6 anycast-address 2620:0000:0861:011c::254/64
dell-leaf1# show interface Vlan 2004 Vlan2004 is up, line protocol is up Description: private1-e-codfw Mode of IPV4 address assignment: not-set Mode of IPV6 address assignment: not-set Interface IPv6 oper status: Disabled IP MTU 9100 bytes
Server also had MTU set to 9100:
root@srv1:~# ip -4 -d addr show enp59s0f0.2004 6: enp59s0f0.2004@enp59s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP group default qlen 1000 link/ether 40:a8:f0:2c:83:10 brd ff:ff:ff:ff:ff:ff promiscuity 0 vlan protocol 802.1Q id 2004 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.192.64.10/22 scope global enp59s0f0.2004 valid_lft forever preferred_lft forever
With this config we can ping the Vlan interface on the switch with a 9000 byte packet and DF bit set:
root@srv1:~# ping -s 9000 -M do 10.192.64.254 PING 10.192.64.254 (10.192.64.254) 9000(9028) bytes of data. 9008 bytes from 10.192.64.254: icmp_seq=1 ttl=64 time=0.244 ms 9008 bytes from 10.192.64.254: icmp_seq=2 ttl=64 time=0.247 ms 9008 bytes from 10.192.64.254: icmp_seq=3 ttl=64 time=0.260 ms 9008 bytes from 10.192.64.254: icmp_seq=4 ttl=64 time=0.265 ms
Remote via VXLAN
Remote MAC Learning on Vlan/VNI
MAC 40:A8:F0:2C:83:10 is being learnt from srv1 connected to leaf1 on Vlan2004.
Logging on to leaf2 we can see it is receiving an EVPN type 2 route corresponding to this MAC:
dell-leaf2# show bgp l2vpn evpn route detail BGP routing table entry for 10.0.1.24:2004:[2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10] Paths: (2 available, best #1) Advertised to non peer-group peers: 172.16.1.9 172.16.1.13 Route [2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10] VNI 102004/404000 65030 65032 10.10.10.1 from 172.16.1.9 (10.0.1.13) Origin IGP, valid, external, best (Router ID) Extended Community: RT:65032:102004 RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 1 11:27:40 2022 Route [2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10] VNI 102004/404000 65030 65032 10.10.10.1 from 172.16.1.13 (10.0.1.14) Origin IGP, valid, external Extended Community: RT:65032:102004 RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 1 11:27:40 2022
This is processed correctly and added to the local EVPN database:
dell-leaf2# show evpn mac vni 102004 mac 40:A8:F0:2C:83:10 MAC: 40:a8:f0:2c:83:10 Remote VTEP: 10.10.10.1 Local Seq: 0 Remote Seq: 0 Kernel Add: Success, Add ReAttempt:0 Neighbors: 10.192.64.10 Active 2620:0:861:11c::10 Active fe80::42a8:f0ff:fe2c:8310 Active
The MAC is also visible in the local L2 forwarding table for Vlan 2004, the "interface" it was "learnt" on shows as VxLAN with the IP of the remote VTEP as expected:
dell-leaf2# show mac address-table Vlan 2004 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2004 40:A8:F0:2C:83:10 DYNAMIC VxLAN DIP: 10.10.10.1
Client to Client L2 Unicast forwarding
Same Switch (Pure L2)
Ethernet0 and Ethernet2 on Leaf1 were configured as trunks allowing the same Vlans:
interface Ethernet0 description test_srv1 mtu 9100 speed 10000 fec none no shutdown switchport trunk allowed Vlan 2004-2005
interface Ethernet2 description srv1_nic2 mtu 9100 speed 10000 fec none no shutdown switchport trunk allowed Vlan 2004-2005
Both of these ports were connected to the same server, so in order to make it work the second sub-interface was added to a different Linux network namespace:
root@srv1:~# ip -br link show enp59s0f0.2004 enp59s0f0.2004@enp59s0f0 UP 40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>
root@srv1:~# ip -br addr show enp59s0f0.2004 enp59s0f0.2004@enp59s0f0 UP 10.192.64.10/24 2620:0:861:11c::10/64 fe80::42a8:f0ff:fe2c:8310/64
root@srv1:~# ip route get 10.192.64.30 10.192.64.30 dev enp59s0f0.2004 src 10.192.64.10 uid 0
root@srv1:~# ip netns exec TESTNS ip -br link show dev enp59s0f1.2004 enp59s0f1.2004@if5 UP 40:a8:f0:2c:83:14 <BROADCAST,MULTICAST,UP,LOWER_UP>
root@srv1:~# ip netns exec TESTNS ip -br addr show dev enp59s0f1.2004 enp59s0f1.2004@if5 UP 10.192.64.30/24 fe80::42a8:f0ff:fe2c:8314/64
root@srv1:~# ip netns exec TESTNS ip route get 10.192.64.20 10.192.64.20 dev enp59s0f1.2004 src 10.192.64.30 uid 0
This use of separate namespaces on the system isolates the two ports, so the "default" namespace, with a port connected to Ethernet0, does not know about the interface in the other one, which is connected to Ethernet2. As such when we send frames from the default namespace for the IP configured on the TESTNS port it will send it out to switch port Ethernet 0, and the switch should forward it via Ethernet 2, looping back to the same server. We only had a single server so this simulates having two separate boxes and sending traffic between them. Results were as expected.
Both MACs are properly learnt on the switch as expected:
dell-leaf1# show mac address-table Vlan 2004 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2004 40:A8:F0:2C:83:10 DYNAMIC Ethernet0 2004 40:A8:F0:2C:83:14 DYNAMIC Ethernet2
ARP works and pings flow successfully via the switch:
root@srv1:~# ip neigh show 10.192.64.30 10.192.64.30 dev enp59s0f0.2004 lladdr 40:a8:f0:2c:83:14 STALE
root@srv1:~# ip netns exec TESTNS ip neigh show 10.192.64.10 10.192.64.10 dev enp59s0f1.2004 lladdr 40:a8:f0:2c:83:10 STALE
root@srv1:~# ping 10.192.64.30 PING 10.192.64.30 (10.192.64.30) 56(84) bytes of data. 64 bytes from 10.192.64.30: icmp_seq=1 ttl=64 time=0.167 ms 64 bytes from 10.192.64.30: icmp_seq=2 ttl=64 time=0.136 ms root@srv1:~# ip netns exec TESTNS ping -c 2 10.192.64.10 PING 10.192.64.10 (10.192.64.10) 56(84) bytes of data. 64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.138 ms 64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.135 ms
TCPDUMP shows the traffic coming in externally with MACs as expected:
root@srv1:~# tcpdump -e -i enp59s0f0.2004 -l -p -nn icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 13:12:24.539625 40:a8:f0:2c:83:14 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.64.30 > 10.192.64.10: ICMP echo request, id 6059, seq 1, length 64 13:12:24.539655 40:a8:f0:2c:83:10 > 40:a8:f0:2c:83:14, ethertype IPv4 (0x0800), length 98: 10.192.64.10 > 10.192.64.30: ICMP echo reply, id 6059, seq 1, length 64 13:12:25.545664 40:a8:f0:2c:83:14 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.64.30 > 10.192.64.10: ICMP echo request, id 6059, seq 2, length 64 13:12:25.545680 40:a8:f0:2c:83:10 > 40:a8:f0:2c:83:14, ethertype IPv4 (0x0800), length 98: 10.192.64.10 > 10.192.64.30: ICMP echo reply, id 6059, seq 2, length 64
Remote Switch (VXLAN tunneled)
Two servers were connected, SRV1 to Leaf1 Ethernet0, and SRV2 to Leaf2 Eth1/1. These ports were configured as follows:
SRV1 was connected to Vlan2005 on Leaf1 with 802.1q trunk encap:
interface Ethernet0 description test_srv1 mtu 9100 speed 10000 fec none no shutdown switchport trunk allowed Vlan 2004-2005
root@srv1:~# ip -br link show enp59s0f0.2005 enp59s0f0.2005@enp59s0f0 UP 40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>
The MAC is learnt locally on Leaf1 as expected:
dell-leaf1# show mac address-table interface Ethernet 0 | grep 2005 2005 40:A8:F0:2C:83:10 DYNAMIC Ethernet0
SRV2 was connected to Vlan2005 on Leaf2 as a regular access port:
interface Eth1/1 description test_Srv2 mtu 9100 speed 10000 fec none no shutdown switchport access Vlan 2005
root@srv2:~# ip -br link show dev enp59s0f0 enp59s0f0 UP 40:a8:f0:2c:31:68 <BROADCAST,MULTICAST,UP,LOWER_UP>
Again the MAC is properly learnt on Leaf2:
dell-leaf2# show mac address-table interface Eth1/1 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2005 40:A8:F0:2C:31:68 DYNAMIC Eth1/1
In terms of EVPN we can see that Leaf1 receives an both a plain MAC-only EVPN route for this address, as well as one with the IP configured on SRV2 included. Both have a next-hop of Leaf2's Loopback1 interface (VTEP IP):
BGP routing table entry for 10.0.1.25:2005:[2]:[0]:[48]:[40:a8:f0:2c:31:68] Paths: (2 available, best #2) Advertised to non peer-group peers: 172.16.1.1 172.16.1.5 Route [2]:[0]:[48]:[40:a8:f0:2c:31:68] VNI 102005 65030 65033 10.10.10.2 from 172.16.1.1 (10.0.1.13) Origin IGP, valid, external Extended Community: RT:65033:102005 ET:8 Last update: Fri Apr 8 15:05:41 2022 Route [2]:[0]:[48]:[40:a8:f0:2c:31:68] VNI 102005 65030 65033 10.10.10.2 from 172.16.1.5 (10.0.1.14) Origin IGP, valid, external, best (Older Path) Extended Community: RT:65033:102005 ET:8 Last update: Tue Apr 5 14:58:20 2022
BGP routing table entry for 10.0.1.25:2005:[2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10] Paths: (2 available, best #2) Advertised to non peer-group peers: 172.16.1.1 172.16.1.5 Route [2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10] VNI 102005/404000 65030 65033 10.10.10.2 from 172.16.1.1 (10.0.1.13) Origin IGP, valid, external Extended Community: RT:65033:102005 RT:65033:404000 ET:8 Rmac:3c:2c:30:4c:81:83 Last update: Fri Apr 8 15:05:41 2022 Route [2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10] VNI 102005/404000 65030 65033 10.10.10.2 from 172.16.1.5 (10.0.1.14) Origin IGP, valid, external, best (Older Path) Extended Community: RT:65033:102005 RT:65033:404000 ET:8 Rmac:3c:2c:30:4c:81:83 Last update: Tue Apr 5 14:58:20 2022
This is properly processed and the EVPN database lists it correctly:
dell-leaf1# show evpn mac vni 102005 mac 40:a8:f0:2c:31:68 MAC: 40:a8:f0:2c:31:68 Remote VTEP: 10.10.10.2 Local Seq: 0 Remote Seq: 0 Kernel Add: Success, Add ReAttempt:0 Neighbors: 10.192.80.10 Active 2620:0:861:cabf::10 Active fe80::42a8:f0ff:fe2c:3168 Active
When we look at the local Vlan forwarding table on Leaf1 we can see that this MAC has been added, with the remote VTEP listed against it:
dell-leaf1# show mac address-table dynamic Vlan 2005 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2005 40:A8:F0:2C:31:68 DYNAMIC VxLAN DIP: 10.10.10.2 2005 40:A8:F0:2C:83:10 DYNAMIC Ethernet0
The reverse is true on Leaf2 for SRV1's MAC learnt on Leaf1, I omit the EVPN details as they work/look much the same in reverse:
dell-leaf2# show mac address-table dynamic Vlan 2005 ----------------------------------------------------------- VLAN MAC-ADDRESS TYPE INTERFACE ----------------------------------------------------------- 2005 40:A8:F0:2C:31:68 DYNAMIC Eth1/1 2005 40:A8:F0:2C:83:10 DYNAMIC VxLAN DIP: 10.10.10.1
With this in place we should be able to send unicast Ethernet frames between the two servers, which the switches will send over the IP underlay using VXLAN. We test this with a simple ping from SRV1 to SRV2:
root@srv1:~# ping -c 2 10.192.80.10 PING 10.192.80.10 (10.192.80.10) 56(84) bytes of data. 64 bytes from 10.192.80.10: icmp_seq=1 ttl=64 time=0.206 ms 64 bytes from 10.192.80.10: icmp_seq=2 ttl=64 time=0.179 ms
--- 10.192.80.10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1029ms rtt min/avg/max/mdev = 0.179/0.192/0.206/0.019 ms
Doing a TCPdump we can see the MAC addresses are expected:
root@srv2:~# tcpdump -e -i enp59s0f0 -l -p -nn icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes 09:08:46.807595 40:a8:f0:2c:83:10 > 40:a8:f0:2c:31:68, ethertype IPv4 (0x0800), length 98: 10.192.80.187 > 10.192.80.10: ICMP echo request, id 23100, seq 1, length 64 09:08:46.807628 40:a8:f0:2c:31:68 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.80.10 > 10.192.80.187: ICMP echo reply, id 23100, seq 1, length 64 09:08:47.827376 40:a8:f0:2c:83:10 > 40:a8:f0:2c:31:68, ethertype IPv4 (0x0800), length 98: 10.192.80.187 > 10.192.80.10: ICMP echo request, id 23100, seq 2, length 64 09:08:47.827403 40:a8:f0:2c:31:68 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.80.10 > 10.192.80.187: ICMP echo reply, id 23100, seq 2, length 64
Client to Client broadcast forwarding / ingress replication
Same Switch (Pure L2)
Config the same as in 3.8.1
ARP cache was deleted in the TESTNS namespace / on port enp59s0f1, before initiating a ping:
root@srv1:~# ip netns exec TESTNS bash root@srv1:~# ip neigh del 10.192.64.10 dev enp59s0f1.2004 root@srv1:~# ping 10.192.64.10 PING 10.192.64.10 (10.192.64.10) 56(84) bytes of data. 64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.388 ms 64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.127 ms
TCPDUMP in the default netns show's the broadcast arriving on the port from switch:
root@srv1:~# tcpdump -i enp59s0f0.2004 -e -l -p -nn arp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 13:30:08.545177 40:a8:f0:2c:83:14 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.192.64.10 tell 10.192.64.30, length 46
Remote Switch (VXLAN tunneled)
Switch and server ports were configured the same as for the unicast test 3.8.2.
We do not have IP Multicast configured in the underlay network, so instead Ingress Replication is used. In this case the edge device receiving the BUM frame duplicates it and sends a copy to each remote switch that has originated an EVPN type 3 route to say it is participating in that VNI/Vlan.
In our lab Leaf1 and Leaf2 both have Vlan2005 configured, bound to VXLAN VNI 102005. We thus see EVPN type 3 routes on each that have originated on the remote side:
Leaf1 output showing route from Leaf2:
dell-leaf1# show bgp l2vpn evpn route detail type multicast | find 10.0.1.25 Route Distinguisher: 10.0.1.25:2005 BGP routing table entry for 10.0.1.25:2005:[3]:[0]:[32]:[10.10.10.2] Paths: (2 available, best #2) Advertised to non peer-group peers: 172.16.1.1 172.16.1.5 65030 65033 10.10.10.2 from 172.16.1.1 (10.0.1.13) Origin IGP, valid, external Extended Community: RT:65033:102005 ET:8 Last update: Fri Apr 8 15:05:42 2022 PMSI Tunnel Type: Ingress Replication, label: 102005 65030 65033 10.10.10.2 from 172.16.1.5 (10.0.1.14) Origin IGP, valid, external, best (Older Path) Extended Community: RT:65033:102005 ET:8 Last update: Tue Apr 5 14:58:21 2022 PMSI Tunnel Type: Ingress Replication, label: 102005
This is also visible in the below command, with the loopback1 IP of Leaf2 (10.10.10.2) shown as a remote VTEP with flooding enabled:
dell-leaf1# show evpn vni 102005 VNI: 102005 Type: L2 Tenant VRF: default Client State: Up VxLAN interface: vtep1-2005 VxLAN ifIndex: 134 Local VTEP IP: 10.10.10.1 Mcast group: 0.0.0.0 Remote VTEPs for this VNI: 10.10.10.2 flood: HER Kernel Add: Success, Add ReAttempt:0 Number of MACs (local and remote) known for this VNI: 2 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 6 Advertise-gw-macip: No
Leaf2 output shows a similar route, but for Leaf1, and again the EVPN database shows the remote IP as being part of the VNI:
dell-leaf2# show bgp l2vpn evpn route detail type multicast | find 2005 Route Distinguisher: 10.0.1.24:2005 BGP routing table entry for 10.0.1.24:2005:[3]:[0]:[32]:[10.10.10.1] Paths: (2 available, best #2) Advertised to non peer-group peers: 172.16.1.9 172.16.1.13 65030 65032 10.10.10.1 from 172.16.1.9 (10.0.1.13) Origin IGP, valid, external Extended Community: RT:65032:102005 ET:8 Last update: Fri Apr 8 15:05:42 2022 PMSI Tunnel Type: Ingress Replication, label: 102005 65030 65032 10.10.10.1 from 172.16.1.13 (10.0.1.14) Origin IGP, valid, external, best (Older Path) Extended Community: RT:65032:102005 ET:8 Last update: Tue Apr 5 14:58:21 2022 PMSI Tunnel Type: Ingress Replication, label: 102005
dell-leaf2# show evpn vni 102005 VNI: 102005 Type: L2 Tenant VRF: Vrf_codfw Client State: Up VxLAN interface: vtep1-2005 VxLAN ifIndex: 77 Local VTEP IP: 10.10.10.2 Mcast group: 0.0.0.0 Remote VTEPs for this VNI: 10.10.10.1 flood: HER Kernel Add: Success, Add ReAttempt:0 Number of MACs (local and remote) known for this VNI: 3 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 10 Advertise-gw-macip: No
This should ensure that all broadcast frames transmitted by SRV1 will be received by SRV2 (possibly with the exception of ARP due to ARP suppression).
To test we generate a random broadcast frame on SRV1, connected to Leaf1:
root@srv1:~# echo "test broadcast in vlan2005" | nc -q1 -u -s 10.192.80.187 -b 255.255.255.255 12345
A TCPdump shows this is received as expected on SRV2, connected to Leaf2:
root@srv2:~# tcpdump -X -e -i enp59s0f0 -l -p -nn src 10.192.80.187 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes 09:35:31.670833 40:a8:f0:2c:83:10 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 69: 10.192.80.187.34163 > 255.255.255.255.12345: UDP, length 27 0x0000: 4500 0037 a5d8 4000 4011 3963 0ac0 50bb E..7..@.@.9c..P. 0x0010: ffff ffff 8573 3039 0023 68d1 7465 7374 .....s09.#h.test 0x0020: 2062 726f 6164 6361 7374 2069 6e20 766c .broadcast.in.vl 0x0030: 616e 3230 3035 0a an2005.
We can conclude that ingress replication is thus working as it should for broadcast traffic.
Client to Client L2 multicast forwarding / ingress replication
Same Switch (Pure L2)
Config the same as in 3.8.1.
Multicast generated in TESTNS with Netcat:
root@srv1:~# echo "test multicast in vlan2004" | nc -q1 -u -b 224.0.0.26 12345
This can be observed leaving the server in a TCPdump:
root@srv1:~# tcpdump -e -i enp59s0f1.2004 -l -p -nn listening on enp59s0f1.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 17:02:40.891494 40:a8:f0:2c:83:14 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.64.129.38067 > 224.0.0.26.12345: UDP, length 27
The frame is recieved on the other port having been flooded within the Vlan by the switch:
root@srv1:~# tcpdump -X -e -i enp59s0f0 -l -nn net 224.0.0.0/8 listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes 17:02:40.891570 40:a8:f0:2c:83:14 > 01:00:5e:00:00:1a, ethertype 802.1Q (0x8100), length 73: vlan 2004, p 0, ethertype IPv4, 10.192.64.129.38067 > 224.0.0.26.12345: UDP, length 27 0x0000: 4500 0037 6f76 4000 0111 dee4 0ac0 4081 E..7ov@.......@. 0x0010: e000 001a 94b3 3039 0023 73a4 7465 7374 ......09.#s.test 0x0020: 206d 756c 7469 6361 7374 2069 6e20 766c .multicast.in.vl 0x0030: 616e 3230 3034 0a an2004.
Remote Switch (VXLAN tunneled)
Two servers are connected, one to each Leaf, the same as in test 3.8.2 / 3.9.2.
We generate a multicast frame on SRV1:
root@srv1:~# echo "test multicast in vlan2005" | nc -q1 -u -s 10.192.80.187 -b 224.0.0.26 12345
This can be seen on the wire going out from SRV1 as expected:
root@srv1:~# tcpdump -e -i enp59s0f0.2005 -l -p -nn net 224.0.0.0/8 listening on enp59s0f0.2005, link-type EN10MB (Ethernet), capture size 262144 bytes 18:08:46.017656 40:a8:f0:2c:83:10 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.80.187.35480 > 224.0.0.26.12345: UDP, length 27
The frame is received on SRV2, connected to the same Vlan but off Leaf2, showing the multicasts have been sent over the VXLAN fabric as required:
root@srv2:~# tcpdump -X -i enp59s0f0 -l -nn -e net 224.0.0.0/8 listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes 18:09:04.782380 40:a8:f0:2c:83:10 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.80.187.35480 > 224.0.0.26.12345: UDP, length 27 0x0000: 4500 0037 9a63 4000 0111 a3bd 0ac0 50bb E..7.c@.......P. 0x0010: e000 001a 8a98 3039 0023 6d84 7465 7374 ......09.#m.test 0x0020: 206d 756c 7469 6361 7374 2069 6e20 766c .multicast.in.vl 0x0030: 616e 3230 3035 0a an2005.
Inter-Vlan/subnet routing via IRB interfaces on same switch
Leaf1 has Vlan interfaces for Vlan2004 and Vlan2005 configured:
interface Vlan2004 description private1-e-codfw ip vrf forwarding Vrf_codfw ipv6 enable ip anycast-address 10.192.64.254/24 ipv6 anycast-address 2620:0:861:11c::254/64
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ipv6 enable ip anycast-address 10.192.80.254/22 ipv6 anycast-address 2620:0:861:cabf::254/64
IPv4
root@srv1:~# mtr --address 10.192.64.129 -n -r 10.192.80.187 Start: 2022-05-05T18:42:03+0000 HOST: srv1 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.192.64.254 0.0% 10 0.2 0.3 0.2 0.6 0.1 2.|-- 10.192.80.187 0.0% 10 0.2 0.2 0.2 0.2 0.0
IPv6
root@srv1:~# mtr --address 2620:0:861:11c::129 -n -r 2620:0:861:cabf::187 Start: 2022-05-05T18:41:03+0000 HOST: srv1 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:11c::254 0.0% 10 0.4 0.8 0.3 5.0 1.4 2.|-- 2620:0:861:cabf::187 0.0% 10 0.2 0.3 0.2 1.1 0.3
Inter-Vlan/subnet routing via IRB interfaces on separate switches
IPv4
From test server 1 in Vlan 2004 on leaf1 ping test server2 in vlan 2005 on leaf2
root@srv1:~# mtr --address 10.192.64.10 -n -r -c 1 10.192.80.10 Start: 2022-04-28T10:03:53+0000 HOST: srv1 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.192.64.254 0.0% 1 0.4 0.4 0.4 0.4 0.0 2.|-- 10.192.80.254 0.0% 1 0.4 0.4 0.4 0.4 0.0 3.|-- 10.192.80.10 0.0% 1 0.3 0.3 0.3 0.3 0.0
IPv6
Routing works as expected, we can ping between subnets across switches just fine:
root@srv1:~# ping -I 2620:0:861:11c::10 2620:0:861:cabf::10 PING 2620:0:861:cabf::10(2620:0:861:cabf::10) from 2620:0:861:11c::10 : 56 data bytes 64 bytes from 2620:0:861:cabf::10: icmp_seq=1 ttl=62 time=0.227 ms 64 bytes from 2620:0:861:cabf::10: icmp_seq=2 ttl=62 time=0.176 ms
One issue we note, however, is that when doing a traceroute the *remote* switch shows up with an IPv6 link-local address.
root@srv1:~# mtr --address 2620:0:861:11c::10 -n -r 2620:0:861:cabf::10 Start: 2022-04-28T10:14:45+0000 HOST: srv1 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:11c::254 0.0% 10 0.4 0.4 0.3 0.4 0.0 2.|-- fe80::3e2c:30ff:fe4c:8183 0.0% 10 0.3 0.8 0.3 4.9 1.4 3.|-- 2620:0:861:cabf::10 0.0% 10 0.3 0.3 0.3 0.3 0.0
The equivalent at hop 2 in the IPv4 test was the address of the Vlan2005 interface on Leaf 2 (10.192.80.254). That interface is configured with a global unicast IPv6 address, and v6 is enabled, but it doesn't use it:
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ipv6 enable ip anycast-address 10.192.80.254/22 ipv6 anycast-address 2620:0:861:cabf::254/64
Looking a little closer it seems the address used to source the ICMPv6 PTB messages is the one assigned to Vlan4000 on the switch (this is the Vlan created to bind to our VRF for VXLAN encap):
admin@dell-leaf2:~$ ip -br addr show | grep 8183 Vlan4000@Bridge UP fe80::3e2c:30ff:fe4c:8183/64
dell-leaf2# show running-configuration interface Vlan 4000 ! interface Vlan4000 description "IRB VLAN" ip vrf forwarding Vrf_codfw ipv6 enable
We see the same behaviour when doing a trace which routes out via an external device connected to the Spine layer within the VRF:
root@srv1:~# mtr --address 2620:0:861:11c::10 -n -r 2001:67c:930:400::26 Start: 2022-04-28T10:39:04+0000 HOST: srv1 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:11c::254 0.0% 10 0.3 0.3 0.3 0.3 0.0 2.|-- fe80::e29:efff:fee1:fb01 0.0% 10 0.5 0.4 0.4 0.5 0.0 3.|-- 2001:67c:930:400::26 0.0% 10 5.5 6.1 2.0 11.1 2.4
In this case again the IP used by Spine1 (hop 2) is link local. Looking at the device this IP is seemingly used on both the external interface and the VLan4000 device:
admin@dell-spine1:~$ ip -br addr show | grep "fee1:fb01" Ethernet104 UP 172.16.1.25/30 2001:67c:930:400::25/64 fe80::e29:efff:fee1:fb01/64 Vlan4000@Bridge UP fe80::e29:efff:fee1:fb01/64
It certainly would be better if the system had of used global unicast address 2001:67c:930:400::25, or another unicast address on the device, to source the ICMP PTB messages. In contrast the Juniper QFX paltform does use global unicast addressing to source ICMPs, which makes troubleshooting easier as the switches can be identified in a trace:
cmooney@elastic1089:~$ sudo traceroute -I -w 1 -s 2620:0:861:109:10:64:130:7 2620:0:861:10d:10:64:134:2 traceroute to 2620:0:861:10d:10:64:134:2 (2620:0:861:10d:10:64:134:2), 30 hops max, 80 byte packets 1 irb-1031.lsw1-e1-eqiad.eqiad.wmnet (2620:0:861:109::1) 0.610 ms 0.583 ms 0.576 ms 2 irb-1035.lsw1-f1-eqiad.eqiad.wmnet (2620:0:861:10d::1) 0.515 ms 0.509 ms 0.504 ms 3 dumpsdata1007.eqiad.wmnet (2620:0:861:10d:10:64:134:2) 0.170 ms * *
IPv6 RA Generation from Vlan Interfaces
NOT WORKING
Vlan configuration as follows on Leaf2:
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ipv6 enable ip anycast-address 10.192.80.254/22 ipv6 anycast-address 2620:0:861:cabf::254/6
No IPv6 RAs are recieved on SRV2 which is connected on an access port in this Vlan however:
root@srv2:~# tcpdump -i enp59s0f0 -l -nn icmp6 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes ^C 0 packets captured 0 packets received by filter 0 packets dropped by kernel
BGP Peering on Vlan segment to end device
IPv4 only
Srv1 is reachable in Vlan2004 on IP 10.192.64.10
dell-leaf1# show ip arp vrf Vrf_codfw Type: R - Remote Neighbor entries (EVPN) ------------------------------------------------------------------------------------------- Address Hardware address Interface Egress Interface Type ------------------------------------------------------------------------------------------- 10.192.64.10 40:a8:f0:2c:83:10 Vlan2004 Ethernet0 Dynamic
We configure a BGP session to it within the VRF as follows:
router bgp 65032 vrf Vrf_codfw router-id 10.0.1.24 log-neighbor-changes bestpath as-path multipath-relax ! address-family ipv4 unicast redistribute connected maximum-paths 16 ! neighbor 10.192.64.10 remote-as 64600 timers 5 20 advertisement-interval 0 bfd local-as 14907 no-prepend replace-as ! address-family ipv4 unicast activate route-map NO_HOST_ROUTES out remove-private-as !
The "NO_HOST_ROUTES" filter is designed to prevent /32 host routes from being sent externally. These end up in the VRF BGP RIB due to the import of EVPN type 2 MAC/IP routes to the local table. Instead we only want to send the routes that have originated locally or as type 5 EVPN routes, i.e. the subnets allocated to our Vlans and not every host IP from the Vlan.
ip prefix-list NO_HOST_ROUTES seq 5 permit 0.0.0.0/0 le 29
route-map NO_HOST_ROUTES permit 100 match ip address prefix-list NO_HOST_ROUTES
Srv1 is set up for BGP (running FRR in this case), configured to peer with the IP of Leaf1 on Vlan2004:
router bgp 64600 neighbor 10.192.64.254 remote-as 14907 neighbor 10.192.64.254 bfd ! address-family ipv4 unicast exit-address-family
With this in place the session establishes:
srv1# show bgp ipv4 unicast summary BGP router identifier 1.1.1.1, local AS number 64600 vrf-id 0 BGP table version 15 RIB entries 5, using 920 bytes of memory Peers 1, using 723 KiB of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc 10.192.64.254 4 14907 482 396 0 0 0 00:21:21 2 1 N/A Total number of neighbors 1
The prefix for Vlan2004 is originated locally by leaf1 (due to the 'redistributed connected'), the prefix for Vlan2005 is in the BGP RIB on Leaf1 already having been redistributed from the EVPN. Both are learnt on the server side:
srv1# show bgp ipv4 unicast neighbors 10.192.64.254 routes BGP table version is 24, local router ID is 1.1.1.1, vrf id 0 Default local pref 100, local AS 64600 Status codes: s suppressed, d damped, h history, * valid, > best, = multipath, i internal, r RIB-failure, S Stale, R Removed Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found Network Next Hop Metric LocPrf Weight Path *> 10.192.64.0/22 10.192.64.254 0 0 14907 ? *> 10.192.80.0/22 10.192.64.254 0 14907 ?
The as-path just shows the "fake" AS configured in the 'local-as' command, as desired. The Spine and Leaf2 AS numbers would appear for 10.192.80.0/22 but we have configured the sessions to "remove-private-as" to validate that works as expected.
IPv4 carrying IPv4 & IPv6 address families
Configuration on the switch was the same as the last test, but the IPv6 unicast address family was also activated for the server peer:
router bgp 65032 vrf Vrf_codfw neighbor 10.192.64.10 remote-as 64600 ! address-family ipv6 unicast activate remove-private-as !
With the same done server-side the BGP session restarted, and came back up announcing IPv6 addresses also:
dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw summary BGP router identifier 10.0.1.24, local AS number 65032 Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State/PfxRcd 10.192.64.10 4 64600 144556 173488 0 0 00:06:06 1 Total number of neighbors 1 Total number of neighbors established 1
Address was received as expected:
dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes BGP routing table information for VRF default Router identifier 10.0.1.24, local AS number 65032 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> 2620:0:861:babe::1/1282620:0:861:11c::10 0 64600 ? Total number of prefixes 1
And placed into the local VRF routing table as required:
dell-leaf1# show ipv6 route vrf Vrf_codfw 2620:0:861:babe::1 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* 2620:0:861:babe::1/128 via fe80::42a8:f0ff:fe2c:8310 Vlan2004 20/0 00:05:50
IPv4 & IPv6 each carrying their own address family
BGP Peering on Vlan segment from Anycast GW IP
IPv4 only
IPv4 carrying IPv4 & IPv6 address families
IPv4 & IPv6 each carrying their own address family
BGP Peering on Vlan segment with Anycast GW, from Unicast IP
NOT WORKING
It does not seem possible to configure an additional IP address on a Vlan interface configured with an Anycast GW.
For instance with this config on Vlan2005 on Leaf2:
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.80.254/22 ipv6 anycast-address 2620:0:861:cabf::254/64
We cannot add an additional (unique to this switch) IP address in addition to the anycast IP (used on all switches):
dell-leaf2# configure terminal dell-leaf2(config)# interface Vlan2005 dell-leaf2(conf-if-Vlan2005)# ip address 10.192.80.253/22 secondary %Error: Primary IPv4 address is not configured for interface: Vlan2005 dell-leaf2(conf-if-Vlan2005)# ip address 10.192.80.253/22 %Error: IP overlap on same interface with IP or IP Anycast 10.192.80.254/22
This is not something that makes much of a difference to us generally. We can likely create unique loopback interfaces, in the overlay VRF, on the switch with unique IPs per device, if needed. The only scenario this is probably needed is if we have an MC-LAG configured, and need to peer to both switches over the bonded link. This is not something that is currently likely for us to require.
BGP Re-establishment if device peered to Anycast GW moved to new switch (i.e. VM live motion)
eBGP Peering in VRF to external device
The VRF was enabled on the spine switches, each of which was connected to a Juniper device to test eBGP routing to an external element from the VRF.
Interface config on Spine1, connected to the Juniper switch was as follows:
dell-spine1# show running-configuration interface Eth1/27 ! interface Eth1/27 description link_to_lsw3-et-0/0/50 mtu 9100 speed 40000 fec none no shutdown ip vrf forwarding Vrf_codfw ip address 172.16.1.25/30 ipv6 address 2001:67c:930:400::25/64 ipv6 enable
BGP sessions were configured, one over IPv4 for IPv4 address-fam, one over IPv6 for IPv6 address-fam:
router bgp 65030 vrf Vrf_codfw router-id 10.0.1.13 ! address-family ipv4 unicast maximum-paths 8 maximum-paths ibgp 8 ! address-family ipv6 unicast maximum-paths 8 maximum-paths ibgp 8 ! neighbor 172.16.1.26 remote-as 65034 ! address-family ipv4 unicast activate ! neighbor 2001:67c:930:400::26 remote-as 65034 ! address-family ipv4 unicast ! address-family ipv6 unicast activate
BGP established as expected for each address family:
dell-spine1# show bgp ipv4 unicast vrf Vrf_codfw summary BGP router identifier 10.0.1.13, local AS number 65030 Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State/PfxRcd 172.16.1.26 4 65034 24829 22586 0 0 01w0d20h 1
dell-spine1# show bgp ipv6 unicast vrf Vrf_codfw summary BGP router identifier 10.0.1.13, local AS number 65030 Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State/PfxRcd 2001:67c:930:400::26 4 65034 167 150 0 0 01:04:40 1
Routes are accepted:
dell-spine1# show bgp ipv4 unicast vrf Vrf_codfw neighbors 172.16.1.26 routes BGP routing table information for VRF default Router identifier 10.0.1.13, local AS number 65030 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> 0.0.0.0/0 172.16.1.26 65034 i Total number of prefixes 1
dell-spine1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 2001:67c:930:400::26 routes BGP routing table information for VRF default Router identifier 10.0.1.13, local AS number 65030 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> ::/0 2001:67c:930:400::26 65034 i Total number of prefixes 1
And make their way into the local FIB:
dell-spine1# show ip route vrf Vrf_codfw 0.0.0.0/0 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* 0.0.0.0/0 via 172.16.1.26 Eth1/27 20/0 01w0d20h
dell-spine1# show ipv6 route vrf Vrf_codfw ::/0 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* ::/0 via fe80::2e21:31ff:fefa:f1bb Eth1/27 20/0 01:04:58
BFD Support on Peering to External Device (VRF)
BFD was enabled for the peering to srv1 in Vlan2004 as follows on the switch:
router bgp 65032 vrf Vrf_codfw neighbor 10.192.64.10 remote-as 64600 bfd
Once it was configured on the server as well the session came up:
dell-leaf1# show bfd peers vrf Vrf_codfw BFD Peers: peer 10.192.64.10 vrf Vrf_codfw interface Vlan2004 ID: 624049837 Remote ID: 3910201229 Status: up Uptime: 0 day(s), 0 hour(s), 0 min(s), 46 sec(s) Diagnostics: ok Remote diagnostics: ok Peer Type: dynamic Local timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 0ms Remote timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 50ms
BFD Support on BGP peering in global table
BFD was set up in the underlay network / global table on the BGP peerings between Spine1 and the two Leaf devices. Simply needs the keyword 'bfd' added under the BGP neighbor definition:
router bgp 65030 neighbor 172.16.1.10 remote-as 65033 bfd
Once configured both sides the sessions come up:
dell-spine1# show bfd peers brief Session Count: 2 SessionId LocalAddress PeerAddress Status Vrf ========= ============ =========== ====== === 2747556315 172.16.1.9 172.16.1.10 UP default 1863681777 172.16.1.1 172.16.1.2 UP default
dell-spine1# show bfd peers BFD Peers: peer 172.16.1.10 vrf default interface Eth1/31 ID: 2747556315 Remote ID: 1404963392 Status: up Uptime: 0 day(s), 0 hour(s), 1 min(s), 27 sec(s) Diagnostics: ok Remote diagnostics: ok Peer Type: dynamic Local timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 0ms Remote timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 50ms peer 172.16.1.2 vrf default interface Eth1/32 ID: 1863681777 Remote ID: 2448635902 Status: up Uptime: 0 day(s), 0 hour(s), 5 min(s), 53 sec(s) Diagnostics: ok Remote diagnostics: ok Peer Type: dynamic Local timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 0ms Remote timers: Detect-multiplier: 3 Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: 50ms
BGP Route propagaton from unicast peer into EVPN and into remote VRF table
IPv4
We learn 1.1.1.1/32 from srv1 on Leaf1:
dell-leaf1# show bgp ipv4 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes BGP routing table information for VRF default Router identifier 10.0.1.24, local AS number 65032 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> 1.1.1.1/32 10.192.64.10 0 64600 ? Total number of prefixes 1
We can see this being learnt on Leaf 2 as an EVPN type-5 route, next-hop the Leaf1 VTEP IP:
dell-leaf2# show bgp l2vpn evpn route detail | find 1.1.1.1 BGP routing table entry for 10.0.1.24:5096:[5]:[0]:[32]:[1.1.1.1] Paths: (2 available, best #1) Advertised to non peer-group peers: 172.16.1.9 172.16.1.13 Route [5]:[0]:[32]:[1.1.1.1] VNI 404000 65030 65032 64600 10.10.10.1 from 172.16.1.9 (10.0.1.13) Origin incomplete, valid, external, best (Router ID) Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 15 15:14:11 2022 Route [5]:[0]:[32]:[1.1.1.1] VNI 404000 65030 65032 64600 10.10.10.1 from 172.16.1.13 (10.0.1.14) Origin incomplete, valid, external Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 15 15:14:11 2022
This is correctly added to the local VRF routing table on Leaf 2:
dell-leaf2# show ip route vrf Vrf_codfw 1.1.1.1 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* 1.1.1.1/32 via 10.10.10.1 Vlan4000 20/0 00:26:49
We can send packets from srv2 (connected to Leaf 2 on Vlan2005) and get a response from 1.1.1.1 on srv1 (connected to Leaf1):
ppaul@srv2:~$ mtr -n -r 1.1.1.1 -c 3 Start: 2022-04-15T15:30:42+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.192.80.254 0.0% 3 0.2 0.3 0.2 0.3 0.0 2.|-- 10.192.64.254 0.0% 3 0.3 0.3 0.3 0.3 0.0 3.|-- 1.1.1.1 0.0% 3 0.3 0.3 0.3 0.3 0.0
IPv6
We learn 2620:0:861:babe::1/128 via BGP from srv1 connected over Vlan2004 on Leaf1:
dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes BGP routing table information for VRF default Router identifier 10.0.1.24, local AS number 65032 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> 2620:0:861:babe::1/1282620:0:861:11c::10 0 64600 ? Total number of prefixes 1
This is received on Leaf 2 as an EVPN type-5 with next-hop of Leaf1's VTEP IP:
dell-leaf2# show bgp l2vpn evpn route detail | find babe BGP routing table entry for 10.0.1.24:5096:[5]:[0]:[128]:[2620:0:861:babe::1] Paths: (2 available, best #1) Advertised to non peer-group peers: 172.16.1.9 172.16.1.13 Route [5]:[0]:[128]:[2620:0:861:babe::1] VNI 404000 65030 65032 64600 10.10.10.1 from 172.16.1.9 (10.0.1.13) Origin incomplete, valid, external, best (Router ID) Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 15 15:28:43 2022 Route [5]:[0]:[128]:[2620:0:861:babe::1] VNI 404000 65030 65032 64600 10.10.10.1 from 172.16.1.13 (10.0.1.14) Origin incomplete, valid, external Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03 Last update: Fri Apr 15 15:28:43 2022
It's added to the local VRF routing table on Leaf 2:
dell-leaf2# show ipv6 route vrf Vrf_codfw 2620:0:861:babe::1 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* 2620:0:861:babe::1/128 via ::ffff:10.10.10.1 Vlan4000 20/0 00:11:10
And we can send traffic to it from srv2 (connected to Leaf2), similar to the v4 test:
root@srv2:~# mtr -n -r -c 3 2620:0:861:babe::1 Start: 2022-04-15T15:37:51+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:cabf::254 0.0% 3 0.4 0.4 0.4 0.4 0.0 2.|-- fe80::3e2c:30ff:fe4b:903 0.0% 3 0.3 0.3 0.3 0.4 0.0 3.|-- 2620:0:861:babe::1 0.0% 3 0.3 0.3 0.3 0.3 0.0
BGP Route propagation to external hosts from VRF
IPv4
BGP peering in vrf Vrf_codfw to srv1 was set up the same way as in test 3.15.
On the server we can see that we learn the routes for Vlan2004 (originated by leaf1) and Vlan2005 (originated by leaf2 and propagated as an EVPN type 5 to leaf 1):
srv1# show bgp ipv4 unicast neighbors 10.192.64.254 routes BGP table version is 15, local router ID is 1.1.1.1, vrf id 0 Default local pref 100, local AS 64600 Status codes: s suppressed, d damped, h history, * valid, > best, = multipath, i internal, r RIB-failure, S Stale, R Removed Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found Network Next Hop Metric LocPrf Weight Path *> 10.192.64.0/22 10.192.64.254 0 0 14907 65032 ? *> 10.192.80.0/22 10.192.64.254 0 14907 65032 65030 65033 ?
IPv6
With IPv6 peering to srv1 we see the same thing, we learn the /64 subnets attached to Vlan2004 (local on BGP peer Leaf 1) and also Vlan 2005 (on Leaf 2, and learnt by Leaf 1 over EVPN):
srv1# show bgp ipv6 unicast neighbors 10.192.64.254 routes BGP table version is 10, local router ID is 1.1.1.1, vrf id 0 Default local pref 100, local AS 64600 Status codes: s suppressed, d damped, h history, * valid, > best, = multipath, i internal, r RIB-failure, S Stale, R Removed Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found Network Next Hop Metric LocPrf Weight Path *> 2620:0:861:11c::/64 fe80::200:ff:fe10:1010 0 0 14907 ? *> 2620:0:861:cabf::/64 fe80::200:ff:fe10:1010 0 14907 65030 65033 ?
ECMP Routing in overlay network
In modern Leaf/Spine fabrics ECMP is a key requirement to make use of multiple redundant links. To test this is working correctly for routes learnt in EVPN type 5 announcements we augmented the configuration described in section 3.19 "eBGP Peering in VRF to external device".
In that section an external device was connected to Spine1 and BGP peering established within the VRF. For this test we connected that same device to Spine2 also, and configured BGP peering from Spine2 for it, announcing the same default routes. Interface and BGP configuration was similar, and BGP established as in that test:
dell-spine2# show bgp ipv4 unicast vrf Vrf_codfw summary BGP router identifier 10.0.1.14, local AS number 65030 Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State/PfxRcd 172.16.1.30 4 65034 24762 22530 0 0 00:00:46 1 dell-spine2# show bgp ipv4 unicast vrf Vrf_codfw neighbors 172.16.1.30 routes BGP routing table information for VRF default Router identifier 10.0.1.14, local AS number 65030 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> 0.0.0.0/0 172.16.1.30 65034 i dell-spine2# show bgp ipv6 unicast vrf Vrf_codfw summary BGP router identifier 10.0.1.14, local AS number 65030 Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State/PfxRcd 2001:67c:930:401::30 4 65034 24 37 0 0 00:00:23 1 Total number of neighbors 1 Total number of neighbors established 1 dell-spine2# show bgp ipv6 unicast vrf Vrf_codfw neighbors 2001:67c:930:401::30 routes BGP routing table information for VRF default Router identifier 10.0.1.14, local AS number 65030 Route status codes: * - valid, > - best Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPref Weight Path *> ::/0 2001:67c:930:401::30 65034 i
With that set up we now have a situation where Spine1 and Spine2 are learning a default route in each address family, from an external peer. These should then get announced to the Leaf devices as separate EVPN type 5 prefixes, with next-hop set to the VTEP IP of Spine1 and Spine2. Ultimately the LEAF devices should then ECMP traffic to both spines based on these two routes.
Looking on Leaf2 we can see this is the case, we have 2 v4 and 2 v6 addresses, 1 of each from each spine:
dell-leaf2# show bgp l2vpn evpn route type prefix BGP table version is 2, local router ID is 10.0.1.25 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-1 prefix: [1]:[ESI]:[EthTag] EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP] Network Next Hop Metric LocPrf Weight Path Extended Community Route Distinguisher: 10.0.1.13:5096 *> [5]:[0]:[0]:[0.0.0.0] 10.10.10.3 0 65030 65034 i RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fb:01 *> [5]:[0]:[0]:[::] 10.10.10.3 0 65030 65034 i RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fb:01 Route Distinguisher: 10.0.1.14:5096 *> [5]:[0]:[0]:[0.0.0.0] 10.10.10.4 0 65030 65034 i RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fc:81 *> [5]:[0]:[0]:[::] 10.10.10.4 0 65030 65034 i RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fc:81
IPv4
Sticking with Leaf2 if we look at the local VRF routing table it does indeed have both defaults, and both are in the FIB indicating it should ECMP across the two next-hops:
dell-leaf2# show ip route vrf Vrf_codfw 0.0.0.0/0 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* 0.0.0.0/0 via 10.10.10.3 Vlan4000 20/0 00:09:44 * via 10.10.10.4 Vlan4000
If we move to SRV2, which is connected on an access port to Leaf2, we can see the different link addresses of the connections from Spine1 or Spine2 in a traceroute if we change the source address (and thus the ECMP hash):
root@srv2:~# mtr --address 10.192.80.10 -n -r -c 3 10.0.1.26 Start: 2022-04-28T12:15:45+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.192.80.254 0.0% 3 0.3 0.3 0.3 0.3 0.0 2.|-- 172.16.1.29 0.0% 3 0.3 0.4 0.3 0.4 0.0 3.|-- 10.0.1.26 0.0% 3 21.2 18.8 13.7 21.5 4.4
^ Hop 2 is Spine2 in this case.
root@srv2:~# mtr --address 10.192.80.21 -n -r -c 3 10.0.1.26 Start: 2022-04-28T12:16:01+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.192.80.254 0.0% 3 0.3 0.3 0.3 0.3 0.0 2.|-- 172.16.1.25 0.0% 3 0.3 0.3 0.3 0.4 0.0 3.|-- 10.0.1.26 0.0% 3 71.6 32.4 11.4 71.6 33.9
^ Hop 2 is Spine1 in this case.
IPv6
Again if we look on Leaf2 the two EVPN routes have been added to the local VRF table and it indicates both are being used:
dell-leaf2# show ipv6 route vrf Vrf_codfw ::/0 Codes: K - kernel route, C - connected, S - static, B - BGP, O - OSPF > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware Destination Gateway Dist/Metric Uptime -------------------------------------------------------------------------------------------------------------------- B>* ::/0 via ::ffff:10.10.10.3 Vlan4000 20/0 00:15:08 * via ::ffff:10.10.10.4 Vlan4000
Doing multiple traceroutes, changing the source IP address, we can see different IPs at hop 2, depending on if the packets are sent by Leaf2 to Spine1 or Spine2:
Source: 2620:0:861:cabf::66 Start: 2022-04-28T12:34:11+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:cabf::254 0.0% 2 0.3 0.5 0.3 0.6 0.2 2.|-- fe80::e29:efff:fee1:fb01 0.0% 2 0.4 0.4 0.4 0.5 0.1 3.|-- 2606:4700:4700::1111 0.0% 2 4.2 5.9 4.2 7.7 2.4
Source: 2620:0:861:cabf::67 Start: 2022-04-28T12:34:17+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:cabf::254 0.0% 2 0.3 0.5 0.3 0.6 0.2 2.|-- fe80::e29:efff:fee1:fb01 0.0% 2 0.4 0.4 0.4 0.4 0.1 3.|-- 2606:4700:4700::1111 0.0% 2 25.1 14.9 4.7 25.1 14.4
Source: 2620:0:861:cabf::68 Start: 2022-04-28T12:34:24+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:cabf::254 0.0% 2 0.3 0.4 0.3 0.6 0.2 2.|-- fe80::e29:efff:fee1:fc81 0.0% 2 0.4 0.4 0.4 0.4 0.0 3.|-- 2606:4700:4700::1111 0.0% 2 4.3 4.6 4.3 4.9 0.4
Source: 2620:0:861:cabf::69 Start: 2022-04-28T12:34:30+0000 HOST: srv2 Loss% Snt Last Avg Best Wrst StDev 1.|-- 2620:0:861:cabf::254 0.0% 2 0.4 0.5 0.4 0.7 0.2 2.|-- fe80::e29:efff:fee1:fc81 0.0% 2 0.4 0.4 0.4 0.5 0.1 3.|-- 2606:4700:4700::1111 0.0% 2 9.4 8.7 8.0 9.4 1.0
ARP Supression
ND Suppression
DHCP Relay & Option 82 insertion
DHCP Relay
DHCP relay can be configured on a Vlan interface, for instance like this on Vlan 2005 on Leaf2:
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ip dhcp-relay 10.192.64.10 vrf Vrf_codfw ip anycast-address 10.192.80.254/22
In the above case the 10.192.64.10 IP is configured on SRV1, which is connected to Vlan 2004 on Leaf1.
On SRV2, which is connected via an access port to Leaf2 on Vlan2005, we can then issue a DHCP request, which completes as desired:
root@srv2:~# dhclient -v enp59s0f0 Internet Systems Consortium DHCP Client 4.3.5 Copyright 2004-2016 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on LPF/enp59s0f0/40:a8:f0:2c:31:68 Sending on LPF/enp59s0f0/40:a8:f0:2c:31:68 Sending on Socket/fallback DHCPDISCOVER on enp59s0f0 to 255.255.255.255 port 67 interval 3 (xid=0xe029ca10) DHCPREQUEST of 10.192.80.129 on enp59s0f0 to 255.255.255.255 port 67 (xid=0x10ca29e0) DHCPOFFER of 10.192.80.129 from 10.192.80.254 DHCPACK of 10.192.80.129 from 10.192.80.254 bound to 10.192.80.129 -- renewal in 147 seconds.
Looking on SRV1 we can see that this has been relayed from the IPv4 address on Vlan 2005 of Leaf2:
root@srv1:/etc/dhcp# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68 listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 14:45:58.969588 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300 14:45:59.971058 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323 14:46:01.890686 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300 14:46:01.890952 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323 14:46:08.511262 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300 14:46:08.511501 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323
This worked fine in this case, when Vlan2005 was only configured on Leaf2. We found, however, that when Vlan 2005 was enabled on both switches, and both had the same IP address configured as Anycast GW, it failed. The reason for this is the source IP Leaf2 used to sent the DHCP packets was also configured on Leaf1. So when SRV1 replied to the request Leaf1 tried to process the packet itself, instead of sending it to Leaf2.
Obviously the relayed DHCP packet needs to come from an IP address only configured on the device that sends it, so that the replies go back to the correct device. An Anycast GW IP is thus not suitable, as it is shared on them all. Unfortunately SONiC does not allow a 'secondary', unique, IP to be added in addition to the Anycast GW IP on a Vlan interface. To get around this limitation we created a new loopback interface on Leaf2, assigned it an IP address, and placed it in the VRF:
dell-leaf2# show running-configuration interface Loopback 2 ! interface Loopback 2 ip vrf forwarding Vrf_codfw ip address 1.2.3.4/32
We then added this config to the Vlan interface to tell it to source the DHCP relays from that IP:
interface Vlan2005 description private1-f-codfw ip vrf forwarding Vrf_codfw ip anycast-address 10.192.80.254/22 ip dhcp-relay 10.192.64.10 vrf Vrf_codfw ip dhcp-relay source-interface Loopback2
With this in place we re-tried the DHCP request from SRV2. We did see the packet come in from source IP 1.2.3.4, but unfortunately the reply went to 10.192.80.254 still:
root@srv1:/etc/dhcp# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68 listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 15:15:05.556947 IP 1.2.3.4.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 302 15:15:06.558359 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323
Further investigation revealed that we could also add this command to the Vlan interface configuration:
interface Vlan2005 ip dhcp-relay 10.192.64.10 vrf Vrf_codfw ip dhcp-relay source-interface Loopback2 ip dhcp-relay link-select
This command caused two additional DHCP Option 82 elements to be added to the packet:
Option 82 Suboption: (5) Link selection Length: 4 Link selection: 10.192.80.254 Option 82 Suboption: (11) Server ID Override Length: 4 Server ID Override: 10.192.80.254
Our vanilla ISC DHCPd test server replied to the source IP of the packet correctly when these attributes were present, and the replies went back to Leaf2 correctly and ultimately the end server:
root@srv1:~# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68 listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes 15:12:02.011778 IP 1.2.3.4.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 314 15:12:02.011987 IP 10.192.64.10.67 > 1.2.3.4.67: BOOTP/DHCP, Reply, length 335
So if we use Anycast GW across multiple switches we need to create a switch-specific Loopback interface, each with a unique IP address, and add this as the "ip dhcp-relay source-interface" as well as enable "ip dhcp-relay link-select".
Option 82
The SONiC devices do insert Option 82 into DHCP messages they relay, as can be seen in the following snippet from a packet capture:
Option: (82) Agent Information Option Length: 29 Option 82 Suboption: (1) Agent Circuit ID Length: 8 Agent Circuit ID: 566c616e32303035 Option 82 Suboption: (2) Agent Remote ID Length: 17 Agent Remote ID: 30303a30303a30303a31303a31303a3130
Decoding these values we see the following information:
Agent Circuit ID: Vlan2005 Agent Remote ID: 00:00:00:10:10:10
The remote ID is the MAC address of the Vlan2005 interface on Leaf2 which sourced the packet:
admin@dell-leaf2:~$ ip -br link show Vlan2005 Vlan2005@Bridge UP 00:00:00:10:10:10 <BROADCAST,MULTICAST,UP,LOWER_UP>
Unfortunately the SONiC config does not seem to provide any mechanism to customize what is included in these values. Specifically it does not seem to allow us to include the switch name, and access port ID the DHCP request was received on, which our install process currently uses to identify the server and assign the correct IP.
IP Filters on Routed interface
IPv4
IPv6
IP Filters on IRB interface
IPv4
IPv6
Filter access to RE/CPU/Device Services
Evaluate if there is a mechanism like the loopback filter in JunOS, or specific daemon filters (SNMP, vty ACL, NTP acls etc) on Cisco. Basically to prevent remote users trying to SSH to switch or similar. If nothing specific it can be done in-band on the external data ports, but it's trickier to implement.
Failover Tests
Spine Switch Failure
Leaf to Spine Link Failure
Management Tests
User Account Creation
SSH Access to Management
SSH Key Auth
Management VRF
We should attempt to place the dedicated management port in a specific VRF ("mgmt" for instance).
Ideally all following tests would be configured with this, and all functions would work / have a way to configure them to work when access is in mgmt vrf.
Interface IPv4 address/mask Master Admin/Oper BGP Neighbor Neighbor IP Flags ----------- ------------------- --------- ------------ -------------- ------------- ------- Ethernet72 172.16.1.6/30 up/up N/A N/A Ethernet76 172.16.1.2/30 up/up N/A N/A Loopback0 10.0.1.24/32 up/up N/A N/A Loopback1 10.10.10.1/32 up/up N/A N/A Vlan2004 10.192.64.254/22 Vrf_codfw up/up N/A N/A A Vlan2005 10.192.80.254/22 Vrf_codfw down/down N/A N/A A docker0 240.127.1.1/24 up/down N/A N/A eth0 10.193.0.185/16 mgmt up/up N/A N/A
SNMP RO Access
Works fine, devices added to LibreNMS system which is polling them via SNMP fine.
NTP
Switch was configured to act as an NTP client in the mangement VRF towards our NTP servers:
dell-leaf2# show running-configuration | grep ntp ntp server 208.80.153.77 minpoll 6 maxpoll 10 ntp server 208.80.154.10 minpoll 6 maxpoll 10 ntp server 208.80.155.108 minpoll 6 maxpoll 10 ntp vrf mgmt
Following this NTP sync was ok:
dell-leaf2# show ntp associations remote refid st t when poll reach delay offset jitter ------------------------------------------------------------------------------------------------------ 208.80.153.77 162.159.200.1 4 u 27 64 3 0.299 4.204 0.426 *208.80.154.10 170.187.158.81 3 u 40 64 17 31.728 0.097 2.565 208.80.155.108 104.171.113.34 3 u 31 64 3 33.109 4.581 0.419 ------------------------------------------------------------------------------------------------------ * master (synced), # master (unsynced), + selected, - candidate, ~ configured
LLDP
Server Side
Server can see the switch and gets the switch port ok:
root@srv1:~# lldpcli show neighbors ports eno1 ------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- Interface: eno1, via: LLDP, RID: 1, Time: 0 day, 00:00:01 Chassis: ChassisID: mac 3c:2c:30:4b:09:00 SysName: dell-leaf1 SysDescr: SONiC Software Version: SONiC.3.4.1-Enterprise_Base - HwSku: DellEMC-S5248f-P-25G-DPB - Distribution: Debian 9.13 - Kernel: 4.9.0-11-2-amd64 MgmtIP: 10.193.0.185 Capability: Bridge, off Capability: Router, on Capability: Wlan, off Capability: Station, on Port: PortID: local Ethernet32 PortDescr: test_srv1_1GTBase TTL: 120 -------------------------------------------------------------------------------
Switch Side
The switch can also see the server details:
dell-leaf1# show lldp neighbor Ethernet 32 ----------------------------------------------------------- LLDP Neighbors ----------------------------------------------------------- Interface: Ethernet32,via: LLDP Chassis: ChassisID: d0:94:66:86:4d:6c SysName: srv1 SysDescr: Ubuntu 18.04.6 LTS Linux 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 TTL: 120 MgmtIP: 1.1.1.1 MgmtIP: 2620:0:861:11c::10 Port PortID: d0:94:66:86:4d:6c PortDescr: eno1 -----------------------------------------------------------
Per-interface LLDP control
The command "no lldp enable" was added to the switch config for Ethernet32, after a short time the switch details were no longer visible from the server:
root@srv1:~# lldpcli show neighbors ports eno1 ------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- root@srv1:~#
Same goes for the switch-side:
dell-leaf1# show lldp neighbor Ethernet 32 ----------------------------------------------------------- LLDP Neighbors ----------------------------------------------------------- dell-leaf1#
sFlow Export
Set up sFlow export. At a minimum I guess we could just capture the packets with tcpdump and compare to what the Junipers send. We could also try to set up pmacct somewhere or something like that to validate the flow data is ok.
Prometheus Export
Set up telegraf etc. as per their guide. We can test with curl don't need to actually set up our Prometheus to scrape it.
Puppet Agent
It would be interesting to test the puppet agent compatibility. We may not go down that road but good to know.
Automation Tests
RESTCONF
Basic tests to make sure we can talk to the interface and apply config.
Partial config replace
i.e. validate we can do a replace on a specific section, i.e. "bgp", without touching entire config.