Migrate from VC switch stack to EVPN

From Wikitech

Introduction

This page details the general approach that can be taken to migrate from our legacy, row-wide virtual-chassis switches, to a routed spine/leaf model running EVPN. The steps are based on what was done and learnings in doing this in codfw row's A and B in late 2023 / early 2024.

Steps

Configure new EVPN switches

The new Spine/Leaf devices should be installed and connected to each other. They should also be in Netbox with links and IPs properly defined. The fabric should be up and ready for connectivity to external devices (core routers, legacy switches) before taking any of the next steps.

Connect spine switches to CRs and legacy ASWs

The first step is to connect the new spine switches to the CRs. Ideally these would be purely routed links, however for migration we need to instead build the Spine side as a L2 trunk port, and use an XLink vlan for BGP peering to the core routers. This allows us to bridge the legacy row-wide vlans to the CRs over the same link.

We also need to connect the Spine switches to the two VC master devices, using an ESI-LAG. This is built on the VC pair as normal (define in Netbox). On the Spine side you create a LAG with the same ID on each of the spines, with one member each. We then add a definition in the homer static config to tell the spines this is an ESI/MC-LAG, and give the same ID to both.

In the case where the core router ports connecting to the legacy switch need to be re-used (or deactivated to fit within overall line-card bandwidth limits) the connections need to be done one at a time. Namely:

  1. Change the CR VRRP priority so that CR2 is primary for all Vlans
  2. Disable the ASW<->CR1 link on both sides by disabling the ports in Netbox & running Homer
  3. Disconnect the ASW->CR1 link
  4. Connect SSW1 to the CR1 using the newly freed port
  5. Connect the ASW port that previously went to CR1 to SSW1
  6. Set the ASW <-> SSW1 ports as LAG members and trunk across all legacy vlans
    1. Remember the ESI-LAG IDs in the static homer YAML config
  7. Set the SSW1 -> CR1 port as a L2 trunk trunking all legacy vlans, and the XLink one for BGP peering
  8. Set the CR1 -> SSW1 port as a standalone port with multiple sub-interfaces, each with the same VRRP/IP configuration as had been on the AE facing the ASW
  9. Push the config

At this point the two CRs should see each other in VRRP again, with the VRRP packets flowing CR1 -> SSW1 -> ASW -> CR2. We can then switch the VRRP master for all vlans to CR1, and repeat the process breaking the ASW->CR2 link and re-using the ports with SSW2 in the middle.

Migrate IP Gateways

At this point we have basically the same setup we had previously. Hosts are still connected to the old switches, and the CR routers are still the gateways for them all. We have just inserted the Spine switches into the layer-2 topology, so they are bridging frames between the CRs and ASWs.

The next step is to enable the BGP peering between the CRs and SSWs, with the normal policies and templates.

Once that is complete we can begin to look at moving the IP gateways to the switching layer. While this can be done without interrupting traffic, the operations are delicate. It's preferable to do it in a non-primary datacenter that is depooled if possible.

IPv6 Gateway

Moving the IPv6 gateway is the easier of the two, as there is no statically configured gateway address / default route configured on end hosts. Instead we rely on IPv6 router advertisements from the network to configure the default. This makes the move fairly easy, as we can have duplicate RAs being sent in the same vlan, then stop them on the CRs, and hosts will begin using the route advertised from the switches when they expire. For a given vlan the steps are:

  1. Define IRB interfaces on all Leaf's in the row and the Spine for the vlan in question
    • (NOTE: If this step is being done before any hosts are moved the IRBs are only needed on the spines)
  1. Assign *new* IPs (v4 & v6) from the vlan's subnet and set these as anycast IP on all of the IRB interfaces
    1. On private vlans/where we've enough IPs we should also add a new, unicast IP for each switch to the IRB
  2. Push the configuration and confirm the IRB interfaces show 'up' on the spine devices, and the new IPs are pingable from hosts
  3. Check the hosts and confirm that they are receiving IPv6 RAs from the new IRBs, with the configured link-local address as next hop
    1. It's probably worth on a test host adding a static route for something via that link-local and confirm a traceroute looks good
  4. Disable IPv6 RA generation for the vlan on the core routers
  5. Check connectivity on all hosts, validate that the system is using the new route after the one via the CRs has expired
  6. Remove the IPv6 VRRP group for the particular vlan in Netbox, as well as the VRRP GW VIP.
  7. Push changes to core routers
    1. Core router sub-interface will just have a regular single IP now
    2. Check the routing table on CRs that it's still reaching the vlan's /64 subnet via this
  8. Change the anycast IPv6 address on all the switch IRBs from the temporary one that was assigned to the one the CRs had been using in VRRP.

IPv4 Gateway

Moving the IPv4 gateway is slightly trickier, as the IP is statically configured on all end hosts, which will have a cached ARP entry for the CR VRRP MAC matching it. Deleting the VRRP IP and then adding on the switches would cause a disruption anyway, but the ARP cache complicates things and would extend the outage beyond that.

Remember at this stage we have a scenario where every switch has an interface on the vlan's subnet, with both an Anycast GW IP and unicast IP. If the vlan is a public one, or for some reason we don't have a unicast IP, we will need to add a "secondary" IP on each of the spines from the subnet for the next step. In either case we should test from a selction of hosts on the vlan that the spine IPv4 IRB addresses are pingable and we can arp for them.

We can then do a trick, with cumin, to add additional routes to the hosts on the vlan. Specifically we use a trick by adding two more specific routes covering the entire IPv4 address space, with next-hop set to the Spine switch IPs on the vlan. This allows us to force hosts to start sending outbound traffic via the Spines, without moving the default gateway IP or waiting on ARP timeout. Roughly the steps are:

  1. With cumin, ping the spine switch unicast IRB IPs in the vlan from every host, to prime their arp cache
    1. 'ping -c 1 <ssw1_irb_ipv4> && ping -c 1 <ssw2_irb_ipv4>'
  2. With cumin, check all devices have successfully cached the MAC addresses
    1. 'ip neigh show <ssw1_irb_ipv4> | awk "{print \$5}" && ip neigh show <ssw2_irb_ipv4> | awk "{print \$5}"'
  3. With cumin, add two static routes to each host to flip traffic to using those IPs instead of the default GW
    1. 'sudo ip route add 0.0.0.0/1 via <ssw1_irb_ipv4> && sudo ip route add 128.0.0.0/1 via <ssw2_irb_ipv4>'
    2. (it is strongly advised to test this on a single test host in advance of pushing to everything)
    3. At this point thorough checks are needed to validate things are ok
  4. With cumin, verify that all hosts are using the newly added routes to get to a test destionation
    1. 'sudo ip -4 route get fibmatch 1.1.1.1 | awk "{print \$1,\$2,\$3}"'
    2. 'sudo ip -4 route get fibmatch 129.1.1.1 | awk "{print \$1,\$2,\$3}"'
  5. Delete the IPv4 VRRP group for the given vlan in Netbox, and the VRRP GW IP
  6. Change the anycast IP on all of the switches from the temporary one assigned when they were created to the one that had been in use on the CRs as VRRP GW
  7. Push the changes to CRs with Homer (this will delete the VRRP config but leave the CRs connected to the vlan)
  8. Push the changes to the Swithces with Homer
  9. With cumin, clear any old ARP entry for the GW IP, and ping it to force an update
    1. 'sudo arp -d <gw_ip> && ping -c 1 <gw_ip>'
  10. With cumin, check that all devices now have the switch anycast MAC in their arp cache for the gateway IP
    1. 'ip neigh show <gw_ip> | awk "{print \$5}"'
  11. Delete the two static routes added earlier, so the hosts go back to using the default gateway IP
    1. 'sudo ip route del 0.0.0.0/1 via <ssw1_irb_ipv4> && sudo ip route del 128.0.0.0/1 via <ssw2_irb_ipv4>'
  12. With cumin, verify all hosts are back to using the default route
    1. 'sudo ip -4 route get fibmatch 1.1.1.1 | awk "{print \$1,\$2,\$3}"'


While this is a little convoluted it worked perfectly for codfw row's A and B.

Migrate Hosts

Moving hosts can then be done fairly easily. We can use the Netbox migration script to configure a LEAF device for all hosts in a given rack and then push out the config. Once that is done we can move the links one by one.

Co-ordinating the move can be tricky between teams, however. When doing future rows it might make more sense to take a team/service-centric approach. I.e. don't move all the hosts in each rack one at a time, but work with each team to come up with a plan for migration that works best and involves the least depooling, master switching etc. To be discussed but while the rack-centric approach works for netops and dc-ops it is not without its challenges for the rest of SRE.

Remove IRB ints from Spines

The principal reason to add the IRB / Anycast GW interfaces to the Spine layer is for the migration to support hosts connected to the old switches still.

Once all hosts are connected to a LSW top-of-rack there is no need for the Spine's to have an IP in the vlan at all, and it is best removed.