Jump to content

SONiC/VXLAN-EVPN Network Testing - Sonic on Dell switches

From Wikitech

This page details the individual tests carried out to validate the Dell hardware / Sonic OS platform. For further information please see the Dell Enterprise Sonic Evaluation write up.

Test Topology

This test was done on 4 Dell switches. Two S5232F devices acting as 'Spines' as two S5248F's operating as Leaf/edge switches. All are based on the Broadcom Trident 3 ASIC and ran Dell SONiC-OS-3.4.1-Enterprise_Base. We have 1 server connected to leaf1 on port Ethernet 0 and the other server connected to leaf 2 on port Eth1/1.

Physical Tests

Transceiver Compatibility

FS.com 40GBase-PSM4 QSFP+

   dell-spine1# show interface transceiver Ethernet 4
   
   Ethernet4
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   connector-type           :  MPO
   date-code                :  2020-08-13
   form-factor              :  QSFP+
   cable-length(m)          :  0
   display-name             :  QSFP+ 40GBASE-PSM4
   max-module-power(Watts)  :  3.5
   max-port-power(Watts)    :  5
   vendor-oui               :  44-7C-7F
   present                  :  PRESENT
   serial-no                :  F1930364698
   vendor                   :  FS
   vendor-part              :  QSFP-PLR4-40G
   vendor-rev               :  1B

Finisar 10GBase-LR SFP+

   dell-leaf2# show interface transceiver Ethernet 0
   
   Ethernet0
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   connector-type           :  LC
   date-code                :  2014-07-19
   form-factor              :  SFP+
   cable-length(m)          :  0
   display-name             :  SFP+ 10GBASE-LR
   max-module-power(Watts)  :  2
   max-port-power(Watts)    :  2.5
   vendor-oui               :  00-90-65
   present                  :  PRESENT
   serial-no                :  AS32MX4
   vendor                   :  FINISAR CORP.
   vendor-part              :  FTLX1471D3BCL
   vendor-rev               :  A


FS.com 10GBase-LR SFP+

   dell-leaf2# show interface transceiver Ethernet 1
   
   Ethernet1
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   connector-type           :  LC
   date-code                :  2018-05-04
   form-factor              :  SFP+
   cable-length(m)          :  0
   display-name             :  SFP+ 10GBASE-LR
   max-module-power(Watts)  :  2
   max-port-power(Watts)    :  2.5
   vendor-oui               :  00-90-65
   present                  :  PRESENT
   serial-no                :  G1804036447
   vendor                   :  FiberStore
   vendor-part              :  SFP-10GLR-31
   vendor-rev               :  A

Juniper 10G 3m DAC 10GBASE-CR-DAC-3.0M

   dell-leaf2# show interface transceiver Ethernet 2
   
   Ethernet2
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   date-code                :  2011-11-29
   form-factor              :  SFP+
   cable-length(m)          :  3
   display-name             :  SFP+ 10GBASE-CR-DAC-3.0M
   max-module-power(Watts)  :  2
   max-port-power(Watts)    :  2.5
   vendor-oui               :  41-50-48
   present                  :  PRESENT
   serial-no                :  APF11460020W1Y
   vendor                   :  Amphenol
   vendor-part              :  584990002
   vendor-rev               :  A


FS.com 10G 3m DAC

   dell-leaf2# show interface transceiver Ethernet 4
   
   Ethernet4
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   date-code                :  2019-06-12
   form-factor              :  SFP+
   cable-length(m)          :  3
   display-name             :  SFP+ 10GBASE-CR-DAC-3.0M
   max-module-power(Watts)  :  2
   max-port-power(Watts)    :  2.5
   vendor-oui               :  00-00-00
   present                  :  PRESENT
   serial-no                :  G1906269417-1
   vendor                   :  FS
   vendor-part              :  SFPP-PC03
   vendor-rev               :  A

Juniper 40GBase-SR4 QSFP+

   dell-spine1# show interface transceiver Ethernet 12
   
   Ethernet12
   ---------------------------------------------------------------------
   Attribute                :  Value/State
   ---------------------------------------------------------------------
   connector-type           :  MPO
   date-code                :  2013-12-04
   form-factor              :  QSFP+
   cable-length(m)          :  0
   display-name             :  QSFP+ 40GBASE-SR4
   max-module-power(Watts)  :  1.5
   max-port-power(Watts)    :  5
   vendor-oui               :  00-17-6A
   present                  :  PRESENT
   serial-no                :  QD491487
   vendor                   :  AVAGO
   vendor-part              :  AFBR-79EQDZ-JU1
   vendor-rev               :  01

FS.com SFP 1000BASE-T

The switch is able to see the transceiver but no way to test since the switch doesn't allow to set the interface to 1G.

Ethernet32   SFP 1000BASE-T              1.5   | 2.5     Fiberstore     G1804219349      SFP-GB-GE-T    N/A        N/A          Ready

To set the interface to 1G, you need to change the portgroup first from 25G to 10G, then change the port speed to 1G:

sudo config portgroup speed 9 10000
dell-leaf1(conf-if-Ethernet32)# speed 1000
interface Ethernet32
description test_srv1_1GTBase
mtu 9100
speed 1000
fec none
no shutdown
   dell-leaf1# show interface Ethernet 32
        
   Ethernet32 is up, line protocol is up
   Hardware is Eth
   Description: test_srv1_1GTBase
   Mode of IPV4 address assignment: not-set
   Mode of IPV6 address assignment: not-set
   Interface IPv6 oper status: Disabled
   IP MTU 9100 bytes
   LineSpeed 1GB, Auto-negotiation off
   FEC: DISABLED
   Last clearing of "show interface" counters: never
   10 seconds input rate 0 packets/sec, 0 bits/sec, 0 Bytes/sec
   10 seconds output rate 0 packets/sec, 0 bits/sec, 0 Bytes/sec
   Input statistics:
           28 packets, 2072 octets
           28 Multicasts, 0 Broadcasts, 0 Unicasts
           0 error, 28 discarded, 0 Oversize
           0 Packets (128 to 255 Octects)
   Output statistics:
           2379 packets, 620868 octets
           2379 Multicasts, 0 Broadcasts, 0 Unicasts
           0 error, 0 discarded, 0 Oversize


   dell-leaf1# show mac address-table interface Ethernet 32
   -----------------------------------------------------------
   VLAN         MAC-ADDRESS         TYPE         INTERFACE           
   -----------------------------------------------------------
   2004        D0:94:66:86:4D:6C   DYNAMIC       Ethernet32

'Breakout' Transceivers

4x10G QSFP+ in 'Breakout' Mode

The switch is able to see the break out cable on both ends. the version used is PSM4

Ethernet0    QSFP+ 40GBASE-PSM4          3.5   | 5.0     FS             F1930364699      QSFP-PLR4-40G  N/A        N/A          Ready
Ethernet16   SFP+ 10GBASE-LR             2.0   | 2.5     FINISAR CORP.  AP40JAN          FTLX1471D3BCL  N/A        N/A          Ready

Command for break out

sudo config interface breakout Ethernet0 '4x10G'

After running Logic to limit the impact
Final list of ports to be deleted :
 {
    "Ethernet0": "100000"
}
Final list of ports to be added :
 {
    "Ethernet2": "10000",
    "Ethernet3": "10000",
    "Ethernet0": "10000",
    "Ethernet1": "10000"
}

dell-spine1# show interface breakout
-----------------------------------------------
Port  Breakout Mode  Status        Interfaces
-----------------------------------------------
1/1   4x10G          Completed     Eth1/1/1
                                   Eth1/1/2
                                   Eth1/1/3
                                   Eth1/1/4

4x25G QSFP28 in 'Brekaout' Mode

Not sure we really need this, be good to know if it's an option. I believe it can be done with 100GBase-PSM4 -> 4x25GBase-LR4:

https://community.fs.com/blog/brief-introduction-to-qsfp-100g-psm4-optical-transceiver.html

100GBase-SR4 should also work over MMF to 4x25GBase-SR

Or otherwise a DAC/AOC:

https://www.fs.com/de-en/products/116289.html

https://www.fs.com/products/70439.html

Transceiver DOM Supprt

Digital Optical Monitoring is the standard that transceiver modules use to send statistics to the element they are plugged into (i.e. the switch).

We need to validate that we can see this data on the Dell switches, particularly the Rx light level which is important for longer optical links.

SFP+ / SFP28

Works ok:

   dell-leaf1# show interface transceiver dom Ethernet 0
    
   -----------------------------------------------------------------------
   Ethernet0     
   -----------------------------------------------------------------------
       Identifier: SFP
       Vendor Name: FINISAR CORP.
       Vendor Part: FTLX1471D3BCL
       ChannelMonitorValues:
           Rx1Power: -2.0094 dBm
           Tx1Bias: 41.5540 mA
           Tx1Power: -1.5009 dBm
       ChannelThresholdValues:
           RxPowerHighAlarm  : 2.5001 dBm
           RxPowerHighWarning: 2.0000 dBm
           RxPowerLowAlarm   : -20.0000 dBm
           RxPowerLowWarning : -18.0134 dBm
           TxBiasHighAlarm   : 85.0000 mA
           TxBiasHighWarning : 80.0000 mA
           TxBiasLowAlarm    : 15.0000 mA
           TxBiasLowWarning  : 20.0000 mA
           TxPowerHighAlarm  : 2.0000 dBm
           TxPowerHighWarning: 0.9999 dBm
           TxPowerLowAlarm   : -7.9997 dBm
           TxPowerLowWarning : -7.0006 dBm
       ModuleMonitorValues:
           Temperature: 36.6055 C
           Vcc: 3.3757 Volts
       ModuleThresholdValues:
           TempHighAlarm  : 78.0000 C
           TempHighWarning: 73.0000 C
           TempLowAlarm   : -13.0000 C
           TempLowWarning : -8.0000 C
           VccHighAlarm   : 3.7000 Volts
           VccHighWarning : 3.6000 Volts
           VccLowAlarm    : 2.9000 Volts
           VccLowWarning  : 3.0000 Volts

QSFP+ / QSFP28

   dell-spine1# show interface transceiver dom Eth1/1/1
    
   -----------------------------------------------------------------------
   Eth1/1/1      
   -----------------------------------------------------------------------
       Identifier: QSFP+
       Vendor Name: FS
       Vendor Part: QSFP-PLR4-40G
       ChannelMonitorValues:
           Rx1Power: -1.7289 dBm
           Rx2Power: -2.6898 dBm
           Rx3Power: -2.3980 dBm
           Rx4Power: -inf dBm

(note above is in breakout mode with only 3 ends connected, so makes sense Rx4Power is zero).

Functional Tests

L2 Access port

dell-leaf1# show running-configuration interface Ethernet 0
interface Ethernet0
 description test_srv1
 mtu 9100
 speed 10000
 fec none
 no shutdown
 switchport access Vlan 2004

dell-leaf1# show running-configuration interface Eth1/1
interface Eth1/1
 description test_Srv2
 mtu 9100
 speed 10000
 fec none
 no shutdown
 switchport access Vlan 2005
interface Vlan2004
description private1-e-codfw
ip vrf forwarding Vrf_codfw
ip anycast-address 10.192.64.254/22
interface Vlan2005
description private1-f-codfw
ip vrf forwarding Vrf_codfw
ip anycast-address 10.192.80.254/22

Test server 1 in vlan 2004

ppaul@srv1:~$ ip -br link show enp59s0f0
enp59s0f0        UP             40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>

Mac is learnt on leaf 1 where test srv1 is connected to port 0, and shown in local Vlan table:

   dell-leaf1# show mac address-table Vlan 2004
   -----------------------------------------------------------
   VLAN         MAC-ADDRESS         TYPE         INTERFACE           
   -----------------------------------------------------------
   2004        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0         


As well as being associated with the attached EVPN VNI:

dell-leaf1# show evpn mac vni 102004
Number of MACs (local and remote) known for this VNI: 2
MAC               Type   Intf/Remote VTEP      VLAN  Seq #'s
00:00:00:10:10:10 local  Vlan2004              2004  0/0
40:a8:f0:2c:83:10 local  Ethernet0             2004  0/0

L2 Trunk port

Interface config:

   dell-leaf1# show running-configuration interface Ethernet 0
   !
   interface Ethernet0
    description test_srv1
    mtu 9100
    speed 10000
    fec none
    no shutdown
    switchport trunk allowed Vlan 2004-2005


With the server configured with an 802.1q sub-interface for each Vlan MACs were learnt correctly on the switch:

   dell-leaf1# show mac address-table interface Ethernet 0
   -----------------------------------------------------------
   VLAN         MAC-ADDRESS         TYPE         INTERFACE           
   -----------------------------------------------------------
   2004        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0           
   2005        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0

Routing to end host from Vlan interface in VRF

IPv4

Server connected via L2 access/trunk port. Switch config:

   interface Vlan2004
    description private1-e-codfw
    ip vrf forwarding Vrf_codfw
    ip anycast-address 10.192.64.254/22
    ipv6 anycast-address 2620:0000:0861:011c::254/64
   

10.192.64.10 is the ip address of test server 1 if we want to ping this test server from leaf 1 we get:

dell-leaf1# ping vrf Vrf_codfw 10.192.64.10
ping: Warning: source address might be selected on device other than Vrf_codfw.
PING 10.192.64.10 (10.192.64.10) from 10.192.64.254 Vrf_codfw: 56(84) bytes of data.
64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.312 ms
64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.302 ms
^C
--- 10.192.64.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1006ms

IPv6

The server is configured with IPv6 global unicast address 2620:0000:0861:011c::10, which we also get a response from:

   dell-leaf1# ping vrf Vrf_codfw 2620:0000:0861:011c::10
   ping6: Warning: source address might be selected on device other than Vrf_codfw.
   PING 2620:0000:0861:011c::10(2620:0:861:11c::10) from 2620:0:861:11c::254 Vrf_codfw: 56 data bytes
   64 bytes from 2620:0:861:11c::10: icmp_seq=1 ttl=64 time=0.385 ms
   64 bytes from 2620:0:861:11c::10: icmp_seq=2 ttl=64 time=0.305 ms


Routed port in VRF

Config added as follows to switch port:

   interface Ethernet2
    description srv1_nic2
    mtu 9100
    speed 10000
    fec none
    no shutdown
    ip vrf forwarding Vrf_codfw
    ip address 192.0.2.8/31
    ipv6 address 2620:0000:0861:011e::1/64

Server port IPs were set as follows:

   root@srv1:~# ip -br addr show dev enp59s0f1
   enp59s0f1        UP             192.0.2.9/31 2620:0:861:11e::2/64 fe80::42a8:f0ff:fe2c:8314/64 

IPv4

IPv4 pingable both sides:

   dell-leaf1# ping vrf Vrf_codfw -I 192.0.2.8 192.0.2.9
   PING 192.0.2.9 (192.0.2.9) from 192.0.2.8 Vrf_codfw: 56(84) bytes of data.
   64 bytes from 192.0.2.9: icmp_seq=1 ttl=64 time=0.265 ms
   64 bytes from 192.0.2.9: icmp_seq=2 ttl=64 time=0.250 ms
   root@srv1:~# ping 192.0.2.8
   PING 192.0.2.8 (192.0.2.8) 56(84) bytes of data.
   64 bytes from 192.0.2.8: icmp_seq=1 ttl=64 time=0.171 ms
   64 bytes from 192.0.2.8: icmp_seq=2 ttl=64 time=0.185 ms
   

IPv6

Same with IPv6:

   dell-leaf1# ping vrf Vrf_codfw -I 2620:0000:0861:011e::1  2620:0000:0861:011e::2
   PING 2620:0000:0861:011e::2(2620:0:861:11e::2) from 2620:0:861:11e::1 Vrf_codfw: 56 data bytes
   64 bytes from 2620:0:861:11e::2: icmp_seq=1 ttl=64 time=0.312 ms
   64 bytes from 2620:0:861:11e::2: icmp_seq=2 ttl=64 time=0.247 ms
   root@srv1:~# ping 2620:0000:0861:011e::1
   PING 2620:0000:0861:011e::1(2620:0:861:11e::1) 56 data bytes
   64 bytes from 2620:0:861:11e::1: icmp_seq=1 ttl=64 time=0.209 ms
   64 bytes from 2620:0:861:11e::1: icmp_seq=2 ttl=64 time=0.214 ms

Jumbo frames across L2 Vlan

Same Switch

Remote via VXLAN

Jumbo frames L3 routing

Same Switch

Switch port had MTU set to 9100, Vlan interface did not have a specific MTU configured but it defaults to 9100:

interface Ethernet0
 description test_srv1
 mtu 9100
 speed 10000
 fec none
 no shutdown
 switchport trunk allowed Vlan 2004-2005
dell-leaf1# show running-configuration interface Vlan 2004
!
interface Vlan2004
 description private1-e-codfw
 ip vrf forwarding Vrf_codfw
 ip anycast-address 10.192.64.254/22
 ipv6 anycast-address 2620:0000:0861:011c::254/64
dell-leaf1# show interface Vlan 2004
     
Vlan2004 is up, line protocol is up
Description: private1-e-codfw
Mode of IPV4 address assignment: not-set
Mode of IPV6 address assignment: not-set
Interface IPv6 oper status: Disabled
IP MTU 9100 bytes


Server also had MTU set to 9100:

root@srv1:~# ip -4 -d addr show enp59s0f0.2004
6: enp59s0f0.2004@enp59s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP group default qlen 1000
    link/ether 40:a8:f0:2c:83:10 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    vlan protocol 802.1Q id 2004 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    inet 10.192.64.10/22 scope global enp59s0f0.2004
       valid_lft forever preferred_lft forever

With this config we can ping the Vlan interface on the switch with a 9000 byte packet and DF bit set:

root@srv1:~# ping -s 9000 -M do 10.192.64.254
PING 10.192.64.254 (10.192.64.254) 9000(9028) bytes of data.
9008 bytes from 10.192.64.254: icmp_seq=1 ttl=64 time=0.244 ms
9008 bytes from 10.192.64.254: icmp_seq=2 ttl=64 time=0.247 ms
9008 bytes from 10.192.64.254: icmp_seq=3 ttl=64 time=0.260 ms
9008 bytes from 10.192.64.254: icmp_seq=4 ttl=64 time=0.265 ms

Remote via VXLAN

Remote MAC Learning on Vlan/VNI

MAC 40:A8:F0:2C:83:10 is being learnt from srv1 connected to leaf1 on Vlan2004.

Logging on to leaf2 we can see it is receiving an EVPN type 2 route corresponding to this MAC:

   dell-leaf2# show bgp l2vpn evpn route detail 
   BGP routing table entry for 10.0.1.24:2004:[2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10]
   Paths: (2 available, best #1)
     Advertised to non peer-group peers:
     172.16.1.9 172.16.1.13
     Route [2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10] VNI 102004/404000
     65030 65032
       10.10.10.1 from 172.16.1.9 (10.0.1.13)
         Origin IGP, valid, external, best (Router ID)
         Extended Community: RT:65032:102004 RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
         Last update: Fri Apr  1 11:27:40 2022
     Route [2]:[0]:[48]:[40:a8:f0:2c:83:10]:[32]:[10.192.64.10] VNI 102004/404000
     65030 65032
       10.10.10.1 from 172.16.1.13 (10.0.1.14)
         Origin IGP, valid, external
         Extended Community: RT:65032:102004 RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
         Last update: Fri Apr  1 11:27:40 2022
   

This is processed correctly and added to the local EVPN database:

   dell-leaf2# show evpn mac vni 102004 mac 40:A8:F0:2C:83:10
   MAC: 40:a8:f0:2c:83:10
    Remote VTEP: 10.10.10.1
    Local Seq: 0 Remote Seq: 0
    Kernel Add: Success, Add ReAttempt:0
    Neighbors:
       10.192.64.10 Active
       2620:0:861:11c::10 Active
       fe80::42a8:f0ff:fe2c:8310 Active

The MAC is also visible in the local L2 forwarding table for Vlan 2004, the "interface" it was "learnt" on shows as VxLAN with the IP of the remote VTEP as expected:

   dell-leaf2# show mac address-table Vlan 2004
   -----------------------------------------------------------
   VLAN         MAC-ADDRESS         TYPE         INTERFACE           
   -----------------------------------------------------------
   2004        40:A8:F0:2C:83:10   DYNAMIC       VxLAN DIP: 10.10.10.1

Client to Client L2 Unicast forwarding

Same Switch (Pure L2)

Ethernet0 and Ethernet2 on Leaf1 were configured as trunks allowing the same Vlans:

   interface Ethernet0
    description test_srv1
    mtu 9100
    speed 10000
    fec none
    no shutdown
    switchport trunk allowed Vlan 2004-2005
   interface Ethernet2
    description srv1_nic2
    mtu 9100
    speed 10000
    fec none
    no shutdown
    switchport trunk allowed Vlan 2004-2005

Both of these ports were connected to the same server, so in order to make it work the second sub-interface was added to a different Linux network namespace:

   root@srv1:~# ip -br link show enp59s0f0.2004
   enp59s0f0.2004@enp59s0f0 UP             40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>
   root@srv1:~# ip -br addr show enp59s0f0.2004
   enp59s0f0.2004@enp59s0f0 UP             10.192.64.10/24 2620:0:861:11c::10/64 fe80::42a8:f0ff:fe2c:8310/64
   root@srv1:~# ip route get 10.192.64.30
   10.192.64.30 dev enp59s0f0.2004 src 10.192.64.10 uid 0


   root@srv1:~# ip netns exec TESTNS ip -br link show dev enp59s0f1.2004
   enp59s0f1.2004@if5 UP             40:a8:f0:2c:83:14 <BROADCAST,MULTICAST,UP,LOWER_UP>
   root@srv1:~# ip netns exec TESTNS ip -br addr show dev enp59s0f1.2004
   enp59s0f1.2004@if5 UP             10.192.64.30/24 fe80::42a8:f0ff:fe2c:8314/64
   root@srv1:~# ip netns exec TESTNS ip route get 10.192.64.20
   10.192.64.20 dev enp59s0f1.2004 src 10.192.64.30 uid 0


This use of separate namespaces on the system isolates the two ports, so the "default" namespace, with a port connected to Ethernet0, does not know about the interface in the other one, which is connected to Ethernet2. As such when we send frames from the default namespace for the IP configured on the TESTNS port it will send it out to switch port Ethernet 0, and the switch should forward it via Ethernet 2, looping back to the same server. We only had a single server so this simulates having two separate boxes and sending traffic between them. Results were as expected.

Both MACs are properly learnt on the switch as expected:

   dell-leaf1# show mac address-table Vlan 2004
   -----------------------------------------------------------
   VLAN         MAC-ADDRESS         TYPE         INTERFACE           
   -----------------------------------------------------------
   2004        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0           
   2004        40:A8:F0:2C:83:14   DYNAMIC       Ethernet2

ARP works and pings flow successfully via the switch:

   root@srv1:~# ip neigh show 10.192.64.30
   10.192.64.30 dev enp59s0f0.2004 lladdr 40:a8:f0:2c:83:14 STALE
   root@srv1:~# ip netns exec TESTNS ip neigh show 10.192.64.10
   10.192.64.10 dev enp59s0f1.2004 lladdr 40:a8:f0:2c:83:10 STALE
   root@srv1:~# ping 10.192.64.30
   PING 10.192.64.30 (10.192.64.30) 56(84) bytes of data.
   64 bytes from 10.192.64.30: icmp_seq=1 ttl=64 time=0.167 ms
   64 bytes from 10.192.64.30: icmp_seq=2 ttl=64 time=0.136 ms
   
   root@srv1:~# ip netns exec TESTNS ping -c 2 10.192.64.10
   PING 10.192.64.10 (10.192.64.10) 56(84) bytes of data.
   64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.138 ms
   64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.135 ms

TCPDUMP shows the traffic coming in externally with MACs as expected:

   root@srv1:~# tcpdump -e -i enp59s0f0.2004 -l -p -nn icmp
   tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
   listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
   13:12:24.539625 40:a8:f0:2c:83:14 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.64.30 > 10.192.64.10: ICMP echo request, id 6059, seq 1, length 64
   13:12:24.539655 40:a8:f0:2c:83:10 > 40:a8:f0:2c:83:14, ethertype IPv4 (0x0800), length 98: 10.192.64.10 > 10.192.64.30: ICMP echo reply, id 6059, seq 1, length 64
   13:12:25.545664 40:a8:f0:2c:83:14 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.64.30 > 10.192.64.10: ICMP echo request, id 6059, seq 2, length 64
   13:12:25.545680 40:a8:f0:2c:83:10 > 40:a8:f0:2c:83:14, ethertype IPv4 (0x0800), length 98: 10.192.64.10 > 10.192.64.30: ICMP echo reply, id 6059, seq 2, length 64

Remote Switch (VXLAN tunneled)

Two servers were connected, SRV1 to Leaf1 Ethernet0, and SRV2 to Leaf2 Eth1/1. These ports were configured as follows:

SRV1 was connected to Vlan2005 on Leaf1 with 802.1q trunk encap:

interface Ethernet0
 description test_srv1
 mtu 9100
 speed 10000
 fec none
 no shutdown
 switchport trunk allowed Vlan 2004-2005
root@srv1:~# ip -br link show enp59s0f0.2005
enp59s0f0.2005@enp59s0f0 UP             40:a8:f0:2c:83:10 <BROADCAST,MULTICAST,UP,LOWER_UP>

The MAC is learnt locally on Leaf1 as expected:

dell-leaf1# show mac address-table interface Ethernet 0 | grep 2005
2005        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0

SRV2 was connected to Vlan2005 on Leaf2 as a regular access port:

interface Eth1/1
 description test_Srv2
 mtu 9100
 speed 10000
 fec none
 no shutdown
 switchport access Vlan 2005
root@srv2:~# ip -br link show dev enp59s0f0
enp59s0f0        UP             40:a8:f0:2c:31:68 <BROADCAST,MULTICAST,UP,LOWER_UP>

Again the MAC is properly learnt on Leaf2:

dell-leaf2# show mac address-table interface Eth1/1
-----------------------------------------------------------
VLAN         MAC-ADDRESS         TYPE         INTERFACE
-----------------------------------------------------------
2005        40:A8:F0:2C:31:68   DYNAMIC       Eth1/1


In terms of EVPN we can see that Leaf1 receives an both a plain MAC-only EVPN route for this address, as well as one with the IP configured on SRV2 included. Both have a next-hop of Leaf2's Loopback1 interface (VTEP IP):

BGP routing table entry for 10.0.1.25:2005:[2]:[0]:[48]:[40:a8:f0:2c:31:68]
Paths: (2 available, best #2)
  Advertised to non peer-group peers:
  172.16.1.1 172.16.1.5
  Route [2]:[0]:[48]:[40:a8:f0:2c:31:68] VNI 102005
  65030 65033
    10.10.10.2 from 172.16.1.1 (10.0.1.13)
      Origin IGP, valid, external
      Extended Community: RT:65033:102005 ET:8
      Last update: Fri Apr  8 15:05:41 2022
  Route [2]:[0]:[48]:[40:a8:f0:2c:31:68] VNI 102005
  65030 65033
    10.10.10.2 from 172.16.1.5 (10.0.1.14)
      Origin IGP, valid, external, best (Older Path)
      Extended Community: RT:65033:102005 ET:8
      Last update: Tue Apr  5 14:58:20 2022
BGP routing table entry for 10.0.1.25:2005:[2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10]
Paths: (2 available, best #2)
  Advertised to non peer-group peers:
  172.16.1.1 172.16.1.5
  Route [2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10] VNI 102005/404000
  65030 65033
    10.10.10.2 from 172.16.1.1 (10.0.1.13)
      Origin IGP, valid, external
      Extended Community: RT:65033:102005 RT:65033:404000 ET:8 Rmac:3c:2c:30:4c:81:83
      Last update: Fri Apr  8 15:05:41 2022
  Route [2]:[0]:[48]:[40:a8:f0:2c:31:68]:[32]:[10.192.80.10] VNI 102005/404000
  65030 65033
    10.10.10.2 from 172.16.1.5 (10.0.1.14)
      Origin IGP, valid, external, best (Older Path)
      Extended Community: RT:65033:102005 RT:65033:404000 ET:8 Rmac:3c:2c:30:4c:81:83
      Last update: Tue Apr  5 14:58:20 2022

This is properly processed and the EVPN database lists it correctly:

dell-leaf1# show evpn mac vni 102005 mac 40:a8:f0:2c:31:68
MAC: 40:a8:f0:2c:31:68
 Remote VTEP: 10.10.10.2
 Local Seq: 0 Remote Seq: 0
 Kernel Add: Success, Add ReAttempt:0
 Neighbors:
    10.192.80.10 Active
    2620:0:861:cabf::10 Active
    fe80::42a8:f0ff:fe2c:3168 Active

When we look at the local Vlan forwarding table on Leaf1 we can see that this MAC has been added, with the remote VTEP listed against it:

dell-leaf1# show mac address-table dynamic Vlan 2005
-----------------------------------------------------------
VLAN         MAC-ADDRESS         TYPE         INTERFACE
-----------------------------------------------------------
2005        40:A8:F0:2C:31:68   DYNAMIC       VxLAN DIP: 10.10.10.2
2005        40:A8:F0:2C:83:10   DYNAMIC       Ethernet0

The reverse is true on Leaf2 for SRV1's MAC learnt on Leaf1, I omit the EVPN details as they work/look much the same in reverse:

dell-leaf2# show mac address-table dynamic Vlan 2005
-----------------------------------------------------------
VLAN         MAC-ADDRESS         TYPE         INTERFACE   
-----------------------------------------------------------
2005        40:A8:F0:2C:31:68   DYNAMIC       Eth1/1  
2005        40:A8:F0:2C:83:10   DYNAMIC       VxLAN DIP: 10.10.10.1

With this in place we should be able to send unicast Ethernet frames between the two servers, which the switches will send over the IP underlay using VXLAN. We test this with a simple ping from SRV1 to SRV2:

root@srv1:~# ping -c 2 10.192.80.10
PING 10.192.80.10 (10.192.80.10) 56(84) bytes of data.
64 bytes from 10.192.80.10: icmp_seq=1 ttl=64 time=0.206 ms
64 bytes from 10.192.80.10: icmp_seq=2 ttl=64 time=0.179 ms
--- 10.192.80.10 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.179/0.192/0.206/0.019 ms

Doing a TCPdump we can see the MAC addresses are expected:

root@srv2:~# tcpdump -e -i enp59s0f0 -l -p -nn icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:08:46.807595 40:a8:f0:2c:83:10 > 40:a8:f0:2c:31:68, ethertype IPv4 (0x0800), length 98: 10.192.80.187 > 10.192.80.10: ICMP echo request, id 23100, seq 1, length 64
09:08:46.807628 40:a8:f0:2c:31:68 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.80.10 > 10.192.80.187: ICMP echo reply, id 23100, seq 1, length 64
09:08:47.827376 40:a8:f0:2c:83:10 > 40:a8:f0:2c:31:68, ethertype IPv4 (0x0800), length 98: 10.192.80.187 > 10.192.80.10: ICMP echo request, id 23100, seq 2, length 64
09:08:47.827403 40:a8:f0:2c:31:68 > 40:a8:f0:2c:83:10, ethertype IPv4 (0x0800), length 98: 10.192.80.10 > 10.192.80.187: ICMP echo reply, id 23100, seq 2, length 64

Client to Client broadcast forwarding / ingress replication

Same Switch (Pure L2)

Config the same as in 3.8.1

ARP cache was deleted in the TESTNS namespace / on port enp59s0f1, before initiating a ping:

   root@srv1:~# ip netns exec TESTNS bash
   root@srv1:~# ip neigh del 10.192.64.10 dev enp59s0f1.2004
   root@srv1:~# ping 10.192.64.10
   PING 10.192.64.10 (10.192.64.10) 56(84) bytes of data.
   64 bytes from 10.192.64.10: icmp_seq=1 ttl=64 time=0.388 ms
   64 bytes from 10.192.64.10: icmp_seq=2 ttl=64 time=0.127 ms

TCPDUMP in the default netns show's the broadcast arriving on the port from switch:

   root@srv1:~# tcpdump -i enp59s0f0.2004 -e -l -p -nn arp
   tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
   listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
   13:30:08.545177 40:a8:f0:2c:83:14 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.192.64.10 tell 10.192.64.30, length 46

Remote Switch (VXLAN tunneled)

Switch and server ports were configured the same as for the unicast test 3.8.2.

We do not have IP Multicast configured in the underlay network, so instead Ingress Replication is used. In this case the edge device receiving the BUM frame duplicates it and sends a copy to each remote switch that has originated an EVPN type 3 route to say it is participating in that VNI/Vlan.

In our lab Leaf1 and Leaf2 both have Vlan2005 configured, bound to VXLAN VNI 102005. We thus see EVPN type 3 routes on each that have originated on the remote side:

Leaf1 output showing route from Leaf2:

dell-leaf1# show bgp l2vpn evpn route detail type multicast | find 10.0.1.25
Route Distinguisher: 10.0.1.25:2005
BGP routing table entry for 10.0.1.25:2005:[3]:[0]:[32]:[10.10.10.2]
Paths: (2 available, best #2)
  Advertised to non peer-group peers:
  172.16.1.1 172.16.1.5
  65030 65033
    10.10.10.2 from 172.16.1.1 (10.0.1.13)
      Origin IGP, valid, external
      Extended Community: RT:65033:102005 ET:8
      Last update: Fri Apr  8 15:05:42 2022
      PMSI Tunnel Type: Ingress Replication, label: 102005
  65030 65033
    10.10.10.2 from 172.16.1.5 (10.0.1.14)
      Origin IGP, valid, external, best (Older Path)
      Extended Community: RT:65033:102005 ET:8
      Last update: Tue Apr  5 14:58:21 2022
      PMSI Tunnel Type: Ingress Replication, label: 102005

This is also visible in the below command, with the loopback1 IP of Leaf2 (10.10.10.2) shown as a remote VTEP with flooding enabled:

dell-leaf1# show evpn vni 102005
VNI: 102005
 Type: L2
 Tenant VRF: default
 Client State: Up
 VxLAN interface: vtep1-2005
 VxLAN ifIndex: 134
 Local VTEP IP: 10.10.10.1
 Mcast group: 0.0.0.0
 Remote VTEPs for this VNI:
  10.10.10.2 flood: HER
  Kernel Add: Success, Add ReAttempt:0
 Number of MACs (local and remote) known for this VNI: 2
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 6
 Advertise-gw-macip: No


Leaf2 output shows a similar route, but for Leaf1, and again the EVPN database shows the remote IP as being part of the VNI:

dell-leaf2# show bgp l2vpn evpn route detail type multicast | find 2005
Route Distinguisher: 10.0.1.24:2005
BGP routing table entry for 10.0.1.24:2005:[3]:[0]:[32]:[10.10.10.1]
Paths: (2 available, best #2)
  Advertised to non peer-group peers:
  172.16.1.9 172.16.1.13
  65030 65032
    10.10.10.1 from 172.16.1.9 (10.0.1.13)
      Origin IGP, valid, external
      Extended Community: RT:65032:102005 ET:8
      Last update: Fri Apr  8 15:05:42 2022
      PMSI Tunnel Type: Ingress Replication, label: 102005
  65030 65032
    10.10.10.1 from 172.16.1.13 (10.0.1.14)
      Origin IGP, valid, external, best (Older Path)
      Extended Community: RT:65032:102005 ET:8
      Last update: Tue Apr  5 14:58:21 2022
      PMSI Tunnel Type: Ingress Replication, label: 102005
dell-leaf2# show evpn vni 102005
VNI: 102005
 Type: L2
 Tenant VRF: Vrf_codfw
 Client State: Up
 VxLAN interface: vtep1-2005
 VxLAN ifIndex: 77
 Local VTEP IP: 10.10.10.2
 Mcast group: 0.0.0.0
 Remote VTEPs for this VNI:
  10.10.10.1 flood: HER
  Kernel Add: Success, Add ReAttempt:0
 Number of MACs (local and remote) known for this VNI: 3
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 10
 Advertise-gw-macip: No


This should ensure that all broadcast frames transmitted by SRV1 will be received by SRV2 (possibly with the exception of ARP due to ARP suppression).

To test we generate a random broadcast frame on SRV1, connected to Leaf1:

root@srv1:~# echo "test broadcast in vlan2005" | nc -q1 -u -s 10.192.80.187 -b 255.255.255.255 12345

A TCPdump shows this is received as expected on SRV2, connected to Leaf2:

root@srv2:~# tcpdump -X -e -i enp59s0f0 -l -p -nn src 10.192.80.187
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:35:31.670833 40:a8:f0:2c:83:10 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 69: 10.192.80.187.34163 > 255.255.255.255.12345: UDP, length 27
   0x0000:  4500 0037 a5d8 4000 4011 3963 0ac0 50bb  E..7..@.@.9c..P.
   0x0010:  ffff ffff 8573 3039 0023 68d1 7465 7374  .....s09.#h.test
   0x0020:  2062 726f 6164 6361 7374 2069 6e20 766c  .broadcast.in.vl
   0x0030:  616e 3230 3035 0a                        an2005.

We can conclude that ingress replication is thus working as it should for broadcast traffic.

Client to Client L2 multicast forwarding / ingress replication

Same Switch (Pure L2)

Config the same as in 3.8.1.

Multicast generated in TESTNS with Netcat:

   root@srv1:~# echo "test multicast in vlan2004" | nc -q1 -u -b 224.0.0.26 12345

This can be observed leaving the server in a TCPdump:

root@srv1:~# tcpdump -e -i enp59s0f1.2004 -l -p -nn
listening on enp59s0f1.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
17:02:40.891494 40:a8:f0:2c:83:14 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.64.129.38067 > 224.0.0.26.12345: UDP, length 27

The frame is recieved on the other port having been flooded within the Vlan by the switch:

root@srv1:~# tcpdump -X -e -i enp59s0f0 -l -nn net 224.0.0.0/8
listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:02:40.891570 40:a8:f0:2c:83:14 > 01:00:5e:00:00:1a, ethertype 802.1Q (0x8100), length 73: vlan 2004, p 0, ethertype IPv4, 10.192.64.129.38067 > 224.0.0.26.12345: UDP, length 27
 0x0000:  4500 0037 6f76 4000 0111 dee4 0ac0 4081  E..7ov@.......@.
 0x0010:  e000 001a 94b3 3039 0023 73a4 7465 7374  ......09.#s.test
 0x0020:  206d 756c 7469 6361 7374 2069 6e20 766c  .multicast.in.vl
 0x0030:  616e 3230 3034 0a                        an2004.

Remote Switch (VXLAN tunneled)

Two servers are connected, one to each Leaf, the same as in test 3.8.2 / 3.9.2.

We generate a multicast frame on SRV1:

root@srv1:~# echo "test multicast in vlan2005" | nc -q1 -u -s 10.192.80.187 -b 224.0.0.26 12345 

This can be seen on the wire going out from SRV1 as expected:

root@srv1:~# tcpdump -e -i enp59s0f0.2005 -l -p -nn net 224.0.0.0/8
listening on enp59s0f0.2005, link-type EN10MB (Ethernet), capture size 262144 bytes
18:08:46.017656 40:a8:f0:2c:83:10 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.80.187.35480 > 224.0.0.26.12345: UDP, length 27

The frame is received on SRV2, connected to the same Vlan but off Leaf2, showing the multicasts have been sent over the VXLAN fabric as required:

root@srv2:~# tcpdump -X -i enp59s0f0 -l -nn -e net 224.0.0.0/8
listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:09:04.782380 40:a8:f0:2c:83:10 > 01:00:5e:00:00:1a, ethertype IPv4 (0x0800), length 69: 10.192.80.187.35480 > 224.0.0.26.12345: UDP, length 27
 0x0000:  4500 0037 9a63 4000 0111 a3bd 0ac0 50bb  E..7.c@.......P.
 0x0010:  e000 001a 8a98 3039 0023 6d84 7465 7374  ......09.#m.test
 0x0020:  206d 756c 7469 6361 7374 2069 6e20 766c  .multicast.in.vl
 0x0030:  616e 3230 3035 0a                        an2005.

Inter-Vlan/subnet routing via IRB interfaces on same switch

Leaf1 has Vlan interfaces for Vlan2004 and Vlan2005 configured:

interface Vlan2004
 description private1-e-codfw
 ip vrf forwarding Vrf_codfw
 ipv6 enable
 ip anycast-address 10.192.64.254/24
 ipv6 anycast-address 2620:0:861:11c::254/64
interface Vlan2005
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ipv6 enable
 ip anycast-address 10.192.80.254/22
 ipv6 anycast-address 2620:0:861:cabf::254/64

IPv4

root@srv1:~# mtr --address 10.192.64.129 -n -r 10.192.80.187
Start: 2022-05-05T18:42:03+0000
HOST: srv1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.192.64.254              0.0%    10    0.2   0.3   0.2   0.6   0.1
  2.|-- 10.192.80.187              0.0%    10    0.2   0.2   0.2   0.2   0.0

IPv6

root@srv1:~# mtr --address 2620:0:861:11c::129 -n -r 2620:0:861:cabf::187
Start: 2022-05-05T18:41:03+0000
HOST: srv1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:11c::254        0.0%    10    0.4   0.8   0.3   5.0   1.4
  2.|-- 2620:0:861:cabf::187       0.0%    10    0.2   0.3   0.2   1.1   0.3

Inter-Vlan/subnet routing via IRB interfaces on separate switches

IPv4

From test server 1 in Vlan 2004 on leaf1 ping test server2 in vlan 2005 on leaf2

root@srv1:~# mtr --address 10.192.64.10 -n -r -c 1 10.192.80.10
Start: 2022-04-28T10:03:53+0000
HOST: srv1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.192.64.254              0.0%     1    0.4   0.4   0.4   0.4   0.0
  2.|-- 10.192.80.254              0.0%     1    0.4   0.4   0.4   0.4   0.0
  3.|-- 10.192.80.10               0.0%     1    0.3   0.3   0.3   0.3   0.0

IPv6

Routing works as expected, we can ping between subnets across switches just fine:

root@srv1:~# ping -I 2620:0:861:11c::10 2620:0:861:cabf::10
PING 2620:0:861:cabf::10(2620:0:861:cabf::10) from 2620:0:861:11c::10 : 56 data bytes
64 bytes from 2620:0:861:cabf::10: icmp_seq=1 ttl=62 time=0.227 ms
64 bytes from 2620:0:861:cabf::10: icmp_seq=2 ttl=62 time=0.176 ms

One issue we note, however, is that when doing a traceroute the *remote* switch shows up with an IPv6 link-local address.

root@srv1:~# mtr --address 2620:0:861:11c::10 -n -r 2620:0:861:cabf::10
Start: 2022-04-28T10:14:45+0000
HOST: srv1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:11c::254        0.0%    10    0.4   0.4   0.3   0.4   0.0
  2.|-- fe80::3e2c:30ff:fe4c:8183  0.0%    10    0.3   0.8   0.3   4.9   1.4
  3.|-- 2620:0:861:cabf::10        0.0%    10    0.3   0.3   0.3   0.3   0.0

The equivalent at hop 2 in the IPv4 test was the address of the Vlan2005 interface on Leaf 2 (10.192.80.254). That interface is configured with a global unicast IPv6 address, and v6 is enabled, but it doesn't use it:

interface Vlan2005
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ipv6 enable
 ip anycast-address 10.192.80.254/22
 ipv6 anycast-address 2620:0:861:cabf::254/64

Looking a little closer it seems the address used to source the ICMPv6 PTB messages is the one assigned to Vlan4000 on the switch (this is the Vlan created to bind to our VRF for VXLAN encap):

admin@dell-leaf2:~$ ip -br addr show  | grep 8183
Vlan4000@Bridge  UP             fe80::3e2c:30ff:fe4c:8183/64 
dell-leaf2# show running-configuration interface Vlan 4000
!
interface Vlan4000
 description "IRB VLAN"
 ip vrf forwarding Vrf_codfw
 ipv6 enable


We see the same behaviour when doing a trace which routes out via an external device connected to the Spine layer within the VRF:

root@srv1:~# mtr --address 2620:0:861:11c::10 -n -r 2001:67c:930:400::26 
Start: 2022-04-28T10:39:04+0000
HOST: srv1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:11c::254        0.0%    10    0.3   0.3   0.3   0.3   0.0
  2.|-- fe80::e29:efff:fee1:fb01   0.0%    10    0.5   0.4   0.4   0.5   0.0
  3.|-- 2001:67c:930:400::26       0.0%    10    5.5   6.1   2.0  11.1   2.4

In this case again the IP used by Spine1 (hop 2) is link local. Looking at the device this IP is seemingly used on both the external interface and the VLan4000 device:

admin@dell-spine1:~$ ip -br addr show | grep "fee1:fb01"
Ethernet104      UP             172.16.1.25/30 2001:67c:930:400::25/64 fe80::e29:efff:fee1:fb01/64 
Vlan4000@Bridge  UP             fe80::e29:efff:fee1:fb01/64 

It certainly would be better if the system had of used global unicast address 2001:67c:930:400::25, or another unicast address on the device, to source the ICMP PTB messages. In contrast the Juniper QFX paltform does use global unicast addressing to source ICMPs, which makes troubleshooting easier as the switches can be identified in a trace:

cmooney@elastic1089:~$ sudo traceroute -I -w 1 -s 2620:0:861:109:10:64:130:7 2620:0:861:10d:10:64:134:2
traceroute to 2620:0:861:10d:10:64:134:2 (2620:0:861:10d:10:64:134:2), 30 hops max, 80 byte packets
 1  irb-1031.lsw1-e1-eqiad.eqiad.wmnet (2620:0:861:109::1)  0.610 ms  0.583 ms  0.576 ms
 2  irb-1035.lsw1-f1-eqiad.eqiad.wmnet (2620:0:861:10d::1)  0.515 ms  0.509 ms  0.504 ms
 3  dumpsdata1007.eqiad.wmnet (2620:0:861:10d:10:64:134:2)  0.170 ms * *

IPv6 RA Generation from Vlan Interfaces

NOT WORKING

Vlan configuration as follows on Leaf2:

interface Vlan2005
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ipv6 enable
 ip anycast-address 10.192.80.254/22
 ipv6 anycast-address 2620:0:861:cabf::254/6

No IPv6 RAs are recieved on SRV2 which is connected on an access port in this Vlan however:

root@srv2:~# tcpdump -i enp59s0f0 -l -nn icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp59s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

BGP Peering on Vlan segment to end device

IPv4 only

Srv1 is reachable in Vlan2004 on IP 10.192.64.10

dell-leaf1# show ip arp vrf Vrf_codfw 
Type: R - Remote Neighbor entries (EVPN)
-------------------------------------------------------------------------------------------
Address             Hardware address    Interface            Egress Interface    Type
-------------------------------------------------------------------------------------------
10.192.64.10        40:a8:f0:2c:83:10   Vlan2004             Ethernet0           Dynamic  

We configure a BGP session to it within the VRF as follows:

router bgp 65032 vrf Vrf_codfw
 router-id 10.0.1.24
 log-neighbor-changes
 bestpath as-path multipath-relax
 !
 address-family ipv4 unicast
  redistribute connected 
  maximum-paths 16
 !
 neighbor 10.192.64.10
  remote-as 64600
  timers 5 20
  advertisement-interval 0
  bfd
  local-as 14907 no-prepend replace-as
  !
  address-family ipv4 unicast
   activate
   route-map NO_HOST_ROUTES out
   remove-private-as
  !

The "NO_HOST_ROUTES" filter is designed to prevent /32 host routes from being sent externally. These end up in the VRF BGP RIB due to the import of EVPN type 2 MAC/IP routes to the local table. Instead we only want to send the routes that have originated locally or as type 5 EVPN routes, i.e. the subnets allocated to our Vlans and not every host IP from the Vlan.

ip prefix-list NO_HOST_ROUTES seq 5 permit 0.0.0.0/0 le 29
route-map NO_HOST_ROUTES permit 100
 match ip address prefix-list NO_HOST_ROUTES

Srv1 is set up for BGP (running FRR in this case), configured to peer with the IP of Leaf1 on Vlan2004:

router bgp 64600
 neighbor 10.192.64.254 remote-as 14907
 neighbor 10.192.64.254 bfd
 !
 address-family ipv4 unicast
 exit-address-family

With this in place the session establishes:

srv1# show bgp ipv4 unicast summary 
BGP router identifier 1.1.1.1, local AS number 64600 vrf-id 0
BGP table version 15
RIB entries 5, using 920 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.192.64.254   4      14907       482       396        0    0    0 00:21:21            2        1 N/A

Total number of neighbors 1

The prefix for Vlan2004 is originated locally by leaf1 (due to the 'redistributed connected'), the prefix for Vlan2005 is in the BGP RIB on Leaf1 already having been redistributed from the EVPN. Both are learnt on the server side:

srv1# show bgp ipv4 unicast neighbors 10.192.64.254 routes 
BGP table version is 24, local router ID is 1.1.1.1, vrf id 0
Default local pref 100, local AS 64600
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.192.64.0/22   10.192.64.254            0             0 14907 ?
*> 10.192.80.0/22   10.192.64.254                          0 14907 ?

The as-path just shows the "fake" AS configured in the 'local-as' command, as desired. The Spine and Leaf2 AS numbers would appear for 10.192.80.0/22 but we have configured the sessions to "remove-private-as" to validate that works as expected.

IPv4 carrying IPv4 & IPv6 address families

Configuration on the switch was the same as the last test, but the IPv6 unicast address family was also activated for the server peer:

router bgp 65032 vrf Vrf_codfw
 neighbor 10.192.64.10
  remote-as 64600
  !
  address-family ipv6 unicast
   activate
   remove-private-as
!

With the same done server-side the BGP session restarted, and came back up announcing IPv6 addresses also:

dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw summary
BGP router identifier 10.0.1.24, local AS number 65032 
Neighbor         V   AS      MsgRcvd   MsgSent   InQ     OutQ    Up/Down         State/PfxRcd   
10.192.64.10     4   64600   144556    173488    0       0       00:06:06        1              
 
Total number of neighbors 1
Total number of neighbors established 1

Address was received as expected:

dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes 
BGP routing table information for VRF default
Router identifier 10.0.1.24, local AS number 65032 
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   2620:0:861:babe::1/1282620:0:861:11c::10  0                              64600 ?
Total number of prefixes 1

And placed into the local VRF routing table as required:

dell-leaf1# show ipv6 route vrf Vrf_codfw 2620:0:861:babe::1
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime      
--------------------------------------------------------------------------------------------------------------------
 B>*   2620:0:861:babe::1/128       via fe80::42a8:f0ff:fe2c:8310 Vlan2004                  20/0          00:05:50

IPv4 & IPv6 each carrying their own address family

BGP Peering on Vlan segment from Anycast GW IP

IPv4 only

IPv4 carrying IPv4 & IPv6 address families

IPv4 & IPv6 each carrying their own address family

BGP Peering on Vlan segment with Anycast GW, from Unicast IP

NOT WORKING

It does not seem possible to configure an additional IP address on a Vlan interface configured with an Anycast GW.

For instance with this config on Vlan2005 on Leaf2:

interface Vlan2005
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ip anycast-address 10.192.80.254/22
 ipv6 anycast-address 2620:0:861:cabf::254/64

We cannot add an additional (unique to this switch) IP address in addition to the anycast IP (used on all switches):

dell-leaf2# configure terminal
dell-leaf2(config)# interface Vlan2005
dell-leaf2(conf-if-Vlan2005)# ip address 10.192.80.253/22 secondary
%Error: Primary IPv4 address is not configured for interface: Vlan2005
dell-leaf2(conf-if-Vlan2005)# ip address 10.192.80.253/22
%Error: IP overlap on same interface with IP or IP Anycast 10.192.80.254/22

This is not something that makes much of a difference to us generally. We can likely create unique loopback interfaces, in the overlay VRF, on the switch with unique IPs per device, if needed. The only scenario this is probably needed is if we have an MC-LAG configured, and need to peer to both switches over the bonded link. This is not something that is currently likely for us to require.

BGP Re-establishment if device peered to Anycast GW moved to new switch (i.e. VM live motion)

eBGP Peering in VRF to external device

The VRF was enabled on the spine switches, each of which was connected to a Juniper device to test eBGP routing to an external element from the VRF.

Interface config on Spine1, connected to the Juniper switch was as follows:

dell-spine1# show running-configuration interface Eth1/27
!
interface Eth1/27
 description link_to_lsw3-et-0/0/50
 mtu 9100
 speed 40000
 fec none
 no shutdown
 ip vrf forwarding Vrf_codfw
 ip address 172.16.1.25/30
 ipv6 address 2001:67c:930:400::25/64
 ipv6 enable

BGP sessions were configured, one over IPv4 for IPv4 address-fam, one over IPv6 for IPv6 address-fam:

router bgp 65030 vrf Vrf_codfw
 router-id 10.0.1.13
 !
 address-family ipv4 unicast
  maximum-paths 8
  maximum-paths ibgp 8
 !
 address-family ipv6 unicast
  maximum-paths 8
  maximum-paths ibgp 8
 !
 neighbor 172.16.1.26
  remote-as 65034
  !
  address-family ipv4 unicast
   activate
 !
 neighbor 2001:67c:930:400::26
  remote-as 65034
  !
  address-family ipv4 unicast
  !
  address-family ipv6 unicast
   activate

BGP established as expected for each address family:

dell-spine1# show bgp ipv4 unicast vrf Vrf_codfw summary
BGP router identifier 10.0.1.13, local AS number 65030
Neighbor        V   AS      MsgRcvd   MsgSent   InQ     OutQ    Up/Down         State/PfxRcd
172.16.1.26     4   65034   24829     22586     0       0       01w0d20h        1  
dell-spine1# show bgp ipv6 unicast vrf Vrf_codfw summary
BGP router identifier 10.0.1.13, local AS number 65030
Neighbor                 V   AS      MsgRcvd   MsgSent   InQ     OutQ    Up/Down         State/PfxRcd
2001:67c:930:400::26     4   65034   167       150       0       0       01:04:40        1

Routes are accepted:

dell-spine1# show bgp ipv4 unicast vrf Vrf_codfw neighbors 172.16.1.26 routes
BGP routing table information for VRF default
Router identifier 10.0.1.13, local AS number 65030
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   0.0.0.0/0           172.16.1.26                                        65034 i
Total number of prefixes 1
dell-spine1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 2001:67c:930:400::26 routes
BGP routing table information for VRF default
Router identifier 10.0.1.13, local AS number 65030
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   ::/0                2001:67c:930:400::26                               65034 i
Total number of prefixes 1

And make their way into the local FIB:

dell-spine1# show ip route vrf Vrf_codfw 0.0.0.0/0
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime
--------------------------------------------------------------------------------------------------------------------
 B>*   0.0.0.0/0                    via 172.16.1.26               Eth1/27                   20/0          01w0d20h


dell-spine1# show ipv6 route vrf Vrf_codfw ::/0
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime
--------------------------------------------------------------------------------------------------------------------
 B>*   ::/0                         via fe80::2e21:31ff:fefa:f1bb Eth1/27                   20/0          01:04:58

BFD Support on Peering to External Device (VRF)

BFD was enabled for the peering to srv1 in Vlan2004 as follows on the switch:

router bgp 65032 vrf Vrf_codfw
 neighbor 10.192.64.10
  remote-as 64600
  bfd

Once it was configured on the server as well the session came up:

dell-leaf1# show bfd peers vrf Vrf_codfw
BFD Peers:
     
    peer 10.192.64.10 vrf Vrf_codfw interface Vlan2004
        ID: 624049837
        Remote ID: 3910201229
        Status: up
        Uptime: 0 day(s), 0 hour(s), 0 min(s), 46 sec(s)
        Diagnostics: ok
        Remote diagnostics: ok
        Peer Type: dynamic
        Local timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 0ms
        Remote timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 50ms

BFD Support on BGP peering in global table

BFD was set up in the underlay network / global table on the BGP peerings between Spine1 and the two Leaf devices. Simply needs the keyword 'bfd' added under the BGP neighbor definition:

router bgp 65030
 neighbor 172.16.1.10
  remote-as 65033
  bfd

Once configured both sides the sessions come up:

dell-spine1# show bfd peers brief 
Session Count: 2
SessionId  LocalAddress                              PeerAddress                               Status         Vrf     
=========  ============                              ===========                               ======         ===     
2747556315 172.16.1.9                                172.16.1.10                               UP             default 
1863681777 172.16.1.1                                172.16.1.2                                UP             default 
dell-spine1# show bfd peers 
BFD Peers:
     
   
    peer 172.16.1.10 vrf default interface Eth1/31
        ID: 2747556315
        Remote ID: 1404963392
        Status: up
        Uptime: 0 day(s), 0 hour(s), 1 min(s), 27 sec(s)
        Diagnostics: ok
        Remote diagnostics: ok
        Peer Type: dynamic
        Local timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 0ms
        Remote timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 50ms
     
   
    peer 172.16.1.2 vrf default interface Eth1/32
        ID: 1863681777
        Remote ID: 2448635902
        Status: up
        Uptime: 0 day(s), 0 hour(s), 5 min(s), 53 sec(s)
        Diagnostics: ok
        Remote diagnostics: ok
        Peer Type: dynamic
        Local timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 0ms
        Remote timers:
            Detect-multiplier: 3
            Receive interval: 300ms
            Transmission interval: 300ms
            Echo transmission interval: 50ms

BGP Route propagaton from unicast peer into EVPN and into remote VRF table

IPv4

We learn 1.1.1.1/32 from srv1 on Leaf1:

dell-leaf1# show bgp ipv4 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes 
BGP routing table information for VRF default
Router identifier 10.0.1.24, local AS number 65032 
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   1.1.1.1/32          10.192.64.10        0                              64600 ?
Total number of prefixes 1

We can see this being learnt on Leaf 2 as an EVPN type-5 route, next-hop the Leaf1 VTEP IP:

dell-leaf2# show bgp l2vpn evpn route detail | find 1.1.1.1
BGP routing table entry for 10.0.1.24:5096:[5]:[0]:[32]:[1.1.1.1]
Paths: (2 available, best #1)
  Advertised to non peer-group peers:
  172.16.1.9 172.16.1.13
  Route [5]:[0]:[32]:[1.1.1.1] VNI 404000
  65030 65032 64600
    10.10.10.1 from 172.16.1.9 (10.0.1.13)
      Origin incomplete, valid, external, best (Router ID)
      Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
      Last update: Fri Apr 15 15:14:11 2022
  Route [5]:[0]:[32]:[1.1.1.1] VNI 404000
  65030 65032 64600
    10.10.10.1 from 172.16.1.13 (10.0.1.14)
      Origin incomplete, valid, external
      Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
      Last update: Fri Apr 15 15:14:11 2022

This is correctly added to the local VRF routing table on Leaf 2:

dell-leaf2# show ip route vrf Vrf_codfw 1.1.1.1
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime      
--------------------------------------------------------------------------------------------------------------------
 B>*   1.1.1.1/32                   via 10.10.10.1                Vlan4000                  20/0          00:26:49   

We can send packets from srv2 (connected to Leaf 2 on Vlan2005) and get a response from 1.1.1.1 on srv1 (connected to Leaf1):

ppaul@srv2:~$ mtr -n -r 1.1.1.1 -c 3
Start: 2022-04-15T15:30:42+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.192.80.254              0.0%     3    0.2   0.3   0.2   0.3   0.0
  2.|-- 10.192.64.254              0.0%     3    0.3   0.3   0.3   0.3   0.0
  3.|-- 1.1.1.1                    0.0%     3    0.3   0.3   0.3   0.3   0.0

IPv6

We learn 2620:0:861:babe::1/128 via BGP from srv1 connected over Vlan2004 on Leaf1:

dell-leaf1# show bgp ipv6 unicast vrf Vrf_codfw neighbors 10.192.64.10 routes
BGP routing table information for VRF default
Router identifier 10.0.1.24, local AS number 65032 
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   2620:0:861:babe::1/1282620:0:861:11c::10  0                              64600 ?
Total number of prefixes 1

This is received on Leaf 2 as an EVPN type-5 with next-hop of Leaf1's VTEP IP:

dell-leaf2# show bgp l2vpn evpn route detail | find babe
BGP routing table entry for 10.0.1.24:5096:[5]:[0]:[128]:[2620:0:861:babe::1]
Paths: (2 available, best #1)
  Advertised to non peer-group peers:
  172.16.1.9 172.16.1.13
  Route [5]:[0]:[128]:[2620:0:861:babe::1] VNI 404000
  65030 65032 64600
    10.10.10.1 from 172.16.1.9 (10.0.1.13)
      Origin incomplete, valid, external, best (Router ID)
      Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
      Last update: Fri Apr 15 15:28:43 2022
  Route [5]:[0]:[128]:[2620:0:861:babe::1] VNI 404000
  65030 65032 64600
    10.10.10.1 from 172.16.1.13 (10.0.1.14)
      Origin incomplete, valid, external
      Extended Community: RT:65032:404000 ET:8 Rmac:3c:2c:30:4b:09:03
      Last update: Fri Apr 15 15:28:43 2022

It's added to the local VRF routing table on Leaf 2:

dell-leaf2# show ipv6 route vrf Vrf_codfw 2620:0:861:babe::1
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime      
--------------------------------------------------------------------------------------------------------------------
 B>*   2620:0:861:babe::1/128       via ::ffff:10.10.10.1         Vlan4000                  20/0          00:11:10  


And we can send traffic to it from srv2 (connected to Leaf2), similar to the v4 test:

root@srv2:~# mtr -n -r -c 3 2620:0:861:babe::1
Start: 2022-04-15T15:37:51+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:cabf::254       0.0%     3    0.4   0.4   0.4   0.4   0.0
  2.|-- fe80::3e2c:30ff:fe4b:903   0.0%     3    0.3   0.3   0.3   0.4   0.0
  3.|-- 2620:0:861:babe::1         0.0%     3    0.3   0.3   0.3   0.3   0.0

BGP Route propagation to external hosts from VRF

IPv4

BGP peering in vrf Vrf_codfw to srv1 was set up the same way as in test 3.15.

On the server we can see that we learn the routes for Vlan2004 (originated by leaf1) and Vlan2005 (originated by leaf2 and propagated as an EVPN type 5 to leaf 1):

srv1# show bgp ipv4 unicast neighbors 10.192.64.254 routes 
BGP table version is 15, local router ID is 1.1.1.1, vrf id 0
Default local pref 100, local AS 64600
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.192.64.0/22   10.192.64.254            0             0 14907 65032 ?
*> 10.192.80.0/22   10.192.64.254                          0 14907 65032 65030 65033 ?


IPv6

With IPv6 peering to srv1 we see the same thing, we learn the /64 subnets attached to Vlan2004 (local on BGP peer Leaf 1) and also Vlan 2005 (on Leaf 2, and learnt by Leaf 1 over EVPN):

srv1# show bgp ipv6 unicast neighbors 10.192.64.254 routes 
BGP table version is 10, local router ID is 1.1.1.1, vrf id 0
Default local pref 100, local AS 64600
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> 2620:0:861:11c::/64
                    fe80::200:ff:fe10:1010
                                             0             0 14907 ?
*> 2620:0:861:cabf::/64
                    fe80::200:ff:fe10:1010
                                                           0 14907 65030 65033 ?

ECMP Routing in overlay network

In modern Leaf/Spine fabrics ECMP is a key requirement to make use of multiple redundant links. To test this is working correctly for routes learnt in EVPN type 5 announcements we augmented the configuration described in section 3.19 "eBGP Peering in VRF to external device".

In that section an external device was connected to Spine1 and BGP peering established within the VRF. For this test we connected that same device to Spine2 also, and configured BGP peering from Spine2 for it, announcing the same default routes. Interface and BGP configuration was similar, and BGP established as in that test:

dell-spine2# show bgp ipv4 unicast vrf Vrf_codfw summary
BGP router identifier 10.0.1.14, local AS number 65030 
Neighbor        V   AS      MsgRcvd   MsgSent   InQ     OutQ    Up/Down         State/PfxRcd   
172.16.1.30     4   65034   24762     22530     0       0       00:00:46        1  

dell-spine2# show bgp ipv4 unicast vrf Vrf_codfw neighbors 172.16.1.30 routes 
BGP routing table information for VRF default
Router identifier 10.0.1.14, local AS number 65030 
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   0.0.0.0/0           172.16.1.30                                        65034 i


dell-spine2# show bgp ipv6 unicast vrf Vrf_codfw summary
BGP router identifier 10.0.1.14, local AS number 65030 
Neighbor                 V   AS      MsgRcvd   MsgSent   InQ     OutQ    Up/Down         State/PfxRcd   
2001:67c:930:401::30     4   65034   24        37        0       0       00:00:23        1              
 
Total number of neighbors 1
Total number of neighbors established 1

dell-spine2# show bgp ipv6 unicast vrf Vrf_codfw neighbors 2001:67c:930:401::30 routes 
BGP routing table information for VRF default
Router identifier 10.0.1.14, local AS number 65030 
Route status codes: * - valid, > - best
Origin codes: i - IGP, e - EGP, ? - incomplete
     Network             Next Hop            Metric      LocPref     Weight Path
*>   ::/0                2001:67c:930:401::30                               65034 i


With that set up we now have a situation where Spine1 and Spine2 are learning a default route in each address family, from an external peer. These should then get announced to the Leaf devices as separate EVPN type 5 prefixes, with next-hop set to the VTEP IP of Spine1 and Spine2. Ultimately the LEAF devices should then ECMP traffic to both spines based on these two routes.

Looking on Leaf2 we can see this is the case, we have 2 v4 and 2 v6 addresses, 1 of each from each spine:

dell-leaf2# show bgp l2vpn evpn route type prefix
BGP table version is 2, local router ID is 10.0.1.25
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
   Network          Next Hop            Metric LocPrf Weight Path
                    Extended Community
Route Distinguisher: 10.0.1.13:5096
*>   [5]:[0]:[0]:[0.0.0.0]
                    10.10.10.3                                     0 65030 65034 i
                    RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fb:01
*>   [5]:[0]:[0]:[::] 10.10.10.3                                     0 65030 65034 i
                    RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fb:01
Route Distinguisher: 10.0.1.14:5096
*>   [5]:[0]:[0]:[0.0.0.0]
                    10.10.10.4                                     0 65030 65034 i
                    RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fc:81
*>   [5]:[0]:[0]:[::] 10.10.10.4                                     0 65030 65034 i
                    RT:65030:404000 ET:8 Rmac:0c:29:ef:e1:fc:81


IPv4

Sticking with Leaf2 if we look at the local VRF routing table it does indeed have both defaults, and both are in the FIB indicating it should ECMP across the two next-hops:

dell-leaf2# show ip route vrf Vrf_codfw 0.0.0.0/0
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime
--------------------------------------------------------------------------------------------------------------------
 B>*   0.0.0.0/0                    via 10.10.10.3                Vlan4000                  20/0          00:09:44
   *                                via 10.10.10.4                Vlan4000

If we move to SRV2, which is connected on an access port to Leaf2, we can see the different link addresses of the connections from Spine1 or Spine2 in a traceroute if we change the source address (and thus the ECMP hash):

root@srv2:~# mtr --address 10.192.80.10 -n -r -c 3 10.0.1.26
Start: 2022-04-28T12:15:45+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.192.80.254              0.0%     3    0.3   0.3   0.3   0.3   0.0
  2.|-- 172.16.1.29                0.0%     3    0.3   0.4   0.3   0.4   0.0
  3.|-- 10.0.1.26                  0.0%     3   21.2  18.8  13.7  21.5   4.4

^ Hop 2 is Spine2 in this case.

root@srv2:~# mtr --address 10.192.80.21 -n -r -c 3 10.0.1.26
Start: 2022-04-28T12:16:01+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.192.80.254              0.0%     3    0.3   0.3   0.3   0.3   0.0
  2.|-- 172.16.1.25                0.0%     3    0.3   0.3   0.3   0.4   0.0
  3.|-- 10.0.1.26                  0.0%     3   71.6  32.4  11.4  71.6  33.9


^ Hop 2 is Spine1 in this case.

IPv6

Again if we look on Leaf2 the two EVPN routes have been added to the local VRF table and it indicates both are being used:

dell-leaf2# show ipv6 route vrf Vrf_codfw ::/0
Codes:  K - kernel route, C - connected, S - static, B - BGP, O - OSPF
        > - selected route, * - FIB route, q - queued route, r - rejected route, # - not installed in hardware
       Destination                  Gateway                                                 Dist/Metric   Uptime      
--------------------------------------------------------------------------------------------------------------------
 B>*   ::/0                         via ::ffff:10.10.10.3         Vlan4000                  20/0          00:15:08    
   *                                via ::ffff:10.10.10.4         Vlan4000                             

Doing multiple traceroutes, changing the source IP address, we can see different IPs at hop 2, depending on if the packets are sent by Leaf2 to Spine1 or Spine2:

Source: 2620:0:861:cabf::66
Start: 2022-04-28T12:34:11+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:cabf::254       0.0%     2    0.3   0.5   0.3   0.6   0.2
  2.|-- fe80::e29:efff:fee1:fb01   0.0%     2    0.4   0.4   0.4   0.5   0.1
  3.|-- 2606:4700:4700::1111       0.0%     2    4.2   5.9   4.2   7.7   2.4
Source: 2620:0:861:cabf::67
Start: 2022-04-28T12:34:17+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:cabf::254       0.0%     2    0.3   0.5   0.3   0.6   0.2
  2.|-- fe80::e29:efff:fee1:fb01   0.0%     2    0.4   0.4   0.4   0.4   0.1
  3.|-- 2606:4700:4700::1111       0.0%     2   25.1  14.9   4.7  25.1  14.4
Source: 2620:0:861:cabf::68
Start: 2022-04-28T12:34:24+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:cabf::254       0.0%     2    0.3   0.4   0.3   0.6   0.2
  2.|-- fe80::e29:efff:fee1:fc81   0.0%     2    0.4   0.4   0.4   0.4   0.0
  3.|-- 2606:4700:4700::1111       0.0%     2    4.3   4.6   4.3   4.9   0.4
Source: 2620:0:861:cabf::69
Start: 2022-04-28T12:34:30+0000
HOST: srv2                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2620:0:861:cabf::254       0.0%     2    0.4   0.5   0.4   0.7   0.2
  2.|-- fe80::e29:efff:fee1:fc81   0.0%     2    0.4   0.4   0.4   0.5   0.1
  3.|-- 2606:4700:4700::1111       0.0%     2    9.4   8.7   8.0   9.4   1.0

ARP Supression

ND Suppression

DHCP Relay & Option 82 insertion

DHCP Relay

DHCP relay can be configured on a Vlan interface, for instance like this on Vlan 2005 on Leaf2:

interface Vlan2005
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ip dhcp-relay 10.192.64.10 vrf Vrf_codfw
 ip anycast-address 10.192.80.254/22

In the above case the 10.192.64.10 IP is configured on SRV1, which is connected to Vlan 2004 on Leaf1.

On SRV2, which is connected via an access port to Leaf2 on Vlan2005, we can then issue a DHCP request, which completes as desired:

root@srv2:~# dhclient -v enp59s0f0
Internet Systems Consortium DHCP Client 4.3.5
Copyright 2004-2016 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Listening on LPF/enp59s0f0/40:a8:f0:2c:31:68
Sending on   LPF/enp59s0f0/40:a8:f0:2c:31:68
Sending on   Socket/fallback
DHCPDISCOVER on enp59s0f0 to 255.255.255.255 port 67 interval 3 (xid=0xe029ca10)
DHCPREQUEST of 10.192.80.129 on enp59s0f0 to 255.255.255.255 port 67 (xid=0x10ca29e0)
DHCPOFFER of 10.192.80.129 from 10.192.80.254
DHCPACK of 10.192.80.129 from 10.192.80.254
bound to 10.192.80.129 -- renewal in 147 seconds.

Looking on SRV1 we can see that this has been relayed from the IPv4 address on Vlan 2005 of Leaf2:

root@srv1:/etc/dhcp# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68
listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
14:45:58.969588 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300
14:45:59.971058 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323
14:46:01.890686 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300
14:46:01.890952 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323
14:46:08.511262 IP 10.192.80.254.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 300
14:46:08.511501 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323

This worked fine in this case, when Vlan2005 was only configured on Leaf2. We found, however, that when Vlan 2005 was enabled on both switches, and both had the same IP address configured as Anycast GW, it failed. The reason for this is the source IP Leaf2 used to sent the DHCP packets was also configured on Leaf1. So when SRV1 replied to the request Leaf1 tried to process the packet itself, instead of sending it to Leaf2.

Obviously the relayed DHCP packet needs to come from an IP address only configured on the device that sends it, so that the replies go back to the correct device. An Anycast GW IP is thus not suitable, as it is shared on them all. Unfortunately SONiC does not allow a 'secondary', unique, IP to be added in addition to the Anycast GW IP on a Vlan interface. To get around this limitation we created a new loopback interface on Leaf2, assigned it an IP address, and placed it in the VRF:

dell-leaf2# show running-configuration interface Loopback 2
!
interface Loopback 2
 ip vrf forwarding Vrf_codfw
 ip address 1.2.3.4/32

We then added this config to the Vlan interface to tell it to source the DHCP relays from that IP:

interface Vlan2005 
 description private1-f-codfw
 ip vrf forwarding Vrf_codfw
 ip anycast-address 10.192.80.254/22
 ip dhcp-relay 10.192.64.10 vrf Vrf_codfw
 ip dhcp-relay source-interface Loopback2

With this in place we re-tried the DHCP request from SRV2. We did see the packet come in from source IP 1.2.3.4, but unfortunately the reply went to 10.192.80.254 still:

root@srv1:/etc/dhcp# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68
listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
15:15:05.556947 IP 1.2.3.4.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 302
15:15:06.558359 IP 10.192.64.10.67 > 10.192.80.254.67: BOOTP/DHCP, Reply, length 323

Further investigation revealed that we could also add this command to the Vlan interface configuration:

interface Vlan2005 
 ip dhcp-relay 10.192.64.10 vrf Vrf_codfw
 ip dhcp-relay source-interface Loopback2
 ip dhcp-relay link-select

This command caused two additional DHCP Option 82 elements to be added to the packet:

Option 82 Suboption: (5) Link selection
    Length: 4
    Link selection: 10.192.80.254
Option 82 Suboption: (11) Server ID Override
    Length: 4
    Server ID Override: 10.192.80.254

Our vanilla ISC DHCPd test server replied to the source IP of the packet correctly when these attributes were present, and the replies went back to Leaf2 correctly and ultimately the end server:

root@srv1:~# tcpdump -i enp59s0f0.2004 -l -nn udp port 67 or udp port 68
listening on enp59s0f0.2004, link-type EN10MB (Ethernet), capture size 262144 bytes
15:12:02.011778 IP 1.2.3.4.67 > 10.192.64.10.67: BOOTP/DHCP, Request from 40:a8:f0:2c:31:68, length 314
15:12:02.011987 IP 10.192.64.10.67 > 1.2.3.4.67: BOOTP/DHCP, Reply, length 335

So if we use Anycast GW across multiple switches we need to create a switch-specific Loopback interface, each with a unique IP address, and add this as the "ip dhcp-relay source-interface" as well as enable "ip dhcp-relay link-select".

Option 82

The SONiC devices do insert Option 82 into DHCP messages they relay, as can be seen in the following snippet from a packet capture:

Option: (82) Agent Information Option
   Length: 29
   Option 82 Suboption: (1) Agent Circuit ID
       Length: 8
       Agent Circuit ID: 566c616e32303035
   Option 82 Suboption: (2) Agent Remote ID
       Length: 17
       Agent Remote ID: 30303a30303a30303a31303a31303a3130

Decoding these values we see the following information:

Agent Circuit ID: Vlan2005
Agent Remote ID:  00:00:00:10:10:10

The remote ID is the MAC address of the Vlan2005 interface on Leaf2 which sourced the packet:

admin@dell-leaf2:~$ ip -br link show Vlan2005
Vlan2005@Bridge  UP             00:00:00:10:10:10 <BROADCAST,MULTICAST,UP,LOWER_UP> 

Unfortunately the SONiC config does not seem to provide any mechanism to customize what is included in these values. Specifically it does not seem to allow us to include the switch name, and access port ID the DHCP request was received on, which our install process currently uses to identify the server and assign the correct IP.

IP Filters on Routed interface

IPv4

IPv6

IP Filters on IRB interface

IPv4

IPv6

Filter access to RE/CPU/Device Services

Evaluate if there is a mechanism like the loopback filter in JunOS, or specific daemon filters (SNMP, vty ACL, NTP acls etc) on Cisco. Basically to prevent remote users trying to SSH to switch or similar. If nothing specific it can be done in-band on the external data ports, but it's trickier to implement.

Failover Tests

Spine Switch Failure

Management Tests

User Account Creation

SSH Access to Management

SSH Key Auth

Management VRF

We should attempt to place the dedicated management port in a specific VRF ("mgmt" for instance).

Ideally all following tests would be configured with this, and all functions would work / have a way to configure them to work when access is in mgmt vrf.

Interface    IPv4 address/mask    Master     Admin/Oper    BGP Neighbor    Neighbor IP    Flags
-----------  -------------------  ---------  ------------  --------------  -------------  -------
Ethernet72   172.16.1.6/30                   up/up         N/A             N/A
Ethernet76   172.16.1.2/30                   up/up         N/A             N/A
Loopback0    10.0.1.24/32                    up/up         N/A             N/A
Loopback1    10.10.10.1/32                   up/up         N/A             N/A
Vlan2004     10.192.64.254/22     Vrf_codfw  up/up         N/A             N/A            A
Vlan2005     10.192.80.254/22     Vrf_codfw  down/down     N/A             N/A            A
docker0      240.127.1.1/24                  up/down       N/A             N/A
eth0         10.193.0.185/16      mgmt       up/up         N/A             N/A

SNMP RO Access

Works fine, devices added to LibreNMS system which is polling them via SNMP fine.

NTP

Switch was configured to act as an NTP client in the mangement VRF towards our NTP servers:

   dell-leaf2# show running-configuration | grep ntp
   ntp server 208.80.153.77 minpoll 6 maxpoll 10
   ntp server 208.80.154.10 minpoll 6 maxpoll 10
   ntp server 208.80.155.108 minpoll 6 maxpoll 10
   ntp vrf mgmt

Following this NTP sync was ok:

   dell-leaf2# show ntp associations 
   remote                      refid            st   t  when   poll   reach  delay  offset       jitter      
   ------------------------------------------------------------------------------------------------------ 
    208.80.153.77              162.159.200.1    4    u  27     64     3      0.299  4.204        0.426        
   *208.80.154.10              170.187.158.81   3    u  40     64     17     31.728 0.097        2.565        
    208.80.155.108             104.171.113.34   3    u  31     64     3      33.109 4.581        0.419       
   ------------------------------------------------------------------------------------------------------
   * master (synced), # master (unsynced), + selected, - candidate, ~ configured

LLDP

Server Side

Server can see the switch and gets the switch port ok:

   root@srv1:~# lldpcli show neighbors ports eno1
   -------------------------------------------------------------------------------
   LLDP neighbors:
   -------------------------------------------------------------------------------
   Interface:    eno1, via: LLDP, RID: 1, Time: 0 day, 00:00:01
     Chassis:     
       ChassisID:    mac 3c:2c:30:4b:09:00
       SysName:      dell-leaf1
       SysDescr:     SONiC Software Version: SONiC.3.4.1-Enterprise_Base - HwSku: DellEMC-S5248f-P-25G-DPB - Distribution: Debian 9.13 - Kernel: 4.9.0-11-2-amd64
       MgmtIP:       10.193.0.185
       Capability:   Bridge, off
       Capability:   Router, on
       Capability:   Wlan, off
       Capability:   Station, on
     Port:        
       PortID:       local Ethernet32
       PortDescr:    test_srv1_1GTBase
       TTL:          120
   -------------------------------------------------------------------------------

Switch Side

The switch can also see the server details:

   dell-leaf1# show lldp neighbor Ethernet 32
   -----------------------------------------------------------
   LLDP Neighbors      
   -----------------------------------------------------------
   Interface:   Ethernet32,via: LLDP
     Chassis:
       ChassisID:    d0:94:66:86:4d:6c
       SysName:      srv1
       SysDescr:     Ubuntu 18.04.6 LTS Linux 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64
       TTL:          120
       MgmtIP:       1.1.1.1
       MgmtIP:       2620:0:861:11c::10
     Port
       PortID:       d0:94:66:86:4d:6c
       PortDescr:    eno1
   -----------------------------------------------------------

Per-interface LLDP control

The command "no lldp enable" was added to the switch config for Ethernet32, after a short time the switch details were no longer visible from the server:

   root@srv1:~# lldpcli show neighbors ports eno1
   -------------------------------------------------------------------------------
   LLDP neighbors:
   -------------------------------------------------------------------------------
   root@srv1:~# 

Same goes for the switch-side:

   dell-leaf1# show lldp neighbor Ethernet 32
   -----------------------------------------------------------
   LLDP Neighbors      
   -----------------------------------------------------------
   dell-leaf1#

sFlow Export

Set up sFlow export. At a minimum I guess we could just capture the packets with tcpdump and compare to what the Junipers send. We could also try to set up pmacct somewhere or something like that to validate the flow data is ok.


Prometheus Export

Set up telegraf etc. as per their guide. We can test with curl don't need to actually set up our Prometheus to scrape it.


Puppet Agent

It would be interesting to test the puppet agent compatibility. We may not go down that road but good to know.

Automation Tests

RESTCONF

Basic tests to make sure we can talk to the interface and apply config.

Partial config replace

i.e. validate we can do a replace on a specific section, i.e. "bgp", without touching entire config.