Network design - Eqiad WMCS Network Infra

From Wikitech

This page details the configuration of the network devices managed by SRE Infrastructure Foundations (netops) to support cloud services in the Eqiad (Equinix, Ashburn) datacenter. Further information on the overall WMCS networking setup, including elements managed by the WMCS team themselves, are on the Portal:Cloud VPS/Admin/Network page.

Physical Network

The dedicated physical network currently consists of 4 racks of equipment, C8, D5, E4 and F4. 6 Juniper QFX-series switches are deployed across the 4 racks. Additionally, rack C8 is connected to the virtual-chassis switches in row B, to provide connectivity for legacy servers installed there.

Racks C8 and D5 each have 2 switches, a main switch that connects servers and also has an uplink to one of the core routers, and a second switch which provides additional ports for servers. Most cloud hosts consume 2 switch ports, which means a single 48-port switch is not sufficient to connect all hosts in the racks, hence the second switch in each.

Racks E4 and F4 currently only have a single top-of-rack each, and it is hoped in time WMCS can adjust the server configs to use 801.1q / vlan tagging so separate physical ports are not required to connect to two or more networks.

The network is configured in a basic Spine/Leaf structure, with the switches in C8 and D5 acting as Spines, aggregating traffic from E4 and F4, and connecting to the outside world via the CR routers. Connections between racks E4/F4 and C8/D5 are optical 40G Ethernet (40GBase-LR) connections over single-mode fiber. The topology is not a perfect Spine/Leaf, however, as there is also a direct connection between cloudsw1-c8 and cloudsw1-d5. This is required for various reasons, principally that there is only a single uplink from each cloudsw to the CR routers, and an alternate path is needed in case of a link or CR failure.


Logical Network

Several networks are configured on the switches described in the last section. At a high-level networks are divided into the "cloud" and "production" realms, which are logically isolated from each other. This isolation is used to support the agreed Cross-Realm traffic guidelines.

Isolation is achieved through the use of Vlans and VRFs (routing-instances in JunOS) on the cloudsw devices. The default routing-instances on the cloudsw's is used for the production realm traffic, and a named routing-instance, 'cloud', is used for the cloud-realm.

Some networks exist purely at layer-2, with the switches only forwarding traffic between servers based on destination MAC address. The switches are unaware of the IP addressing used on those layer-2 segments and do not participate in routing. Those networks only carry traffic internal to the cloud realm. Specific cloud hosts, like cloudnet and cloudgw, act as the layer-3 routers for devices on these segments. They are not technically part of the cloud vrf, as there are no IP interfaces belonging to them on the switches, but they are considered to be part of the cloud realm.

Switch IP interfaces (Vlan/irb facing hosts and 'link' networks between switches) are configured to support 9000-bytes of IP payload. Certain Cloud hosts, such as the ceph nodes, need the ability to send jumbo frames between them within the production network.


Production Realm

The below diagram shows an overview of the production realm routing configured on the cloud switches.

CR Uplinks

Cloudsw1-c8 and cloudsw1-d5 each have a 10G uplink to one of our core routers (CRs). 802.1q sub-interfaces are configured on these links, and one sub-interface is used on each switch for production realm traffic. eBGP is used to exchange routes with the CRs, with separate BGP sessions used to exchange IPv4 and IPv6 routes. IPv4 routes are exchanged over a session between the IPv4 addresses either side, and the IPv6 routes exchanged over IPv6.

The CRs only announce default routes to the switches. The switches announce all routes from the production realm to the CR routers. This includes all the connected subnets on each switch (both for end-hosts/production and link networks/infrastructure), as well as the production loopbacks configured on the switches. A maximum-prefix limit of 1,000 routes is applied to the eBGP sessions on the cloudsw side. This is a safeguard in case somehow a full routing table was announced by a CR in error, to protect the switches (which have limited TCAM space).

On the CR switches these peerings are in the Switch4 and Switch6 groups, along with the peerings to EVPN Spine switches, which announce a similar set of routes (production subnets and device loopbacks). Filters are deployed on these peerings to ensure only correct routes are accepted by the CRs. The CR sub-interfaces have the cr-labs filters applied to them, which controls what traffic is allowed in from cloud hosts. Additionally uRPF if configured on these interfaces to ensure traffic is sourced from valid sources.

Routed Networks

Cloud Host Networks

Each rack has a dedicated Vlan for hosts to connect to the production realm. Cloud hosts use this Vlan for their initial provisioning among other things. All of these Vlans have a /24 IPv4 subnet and a /64 IPv6 subnet configured. The cloudsw1 devices in each rack act as the L3 gateway for these subnets. In most cases the switches multicast IPv6 RAs to all hosts in the Vlan. RAs are not, however, enabled on the cloudsw2 devices in C8/D5, as we want to prevent hosts connected to cloudsw1 in those racks using them as gateway.

IPv4 DHCP relay is enabled on all switches, to forward DHCP messages from connected hosts to the install server which processes the requests. DHCP Option 82 information is added to DHCP DISCOVER messages by the switch, which allows the install server to identify the host making the request and assign the correct IP. The cloudsw2 devices in C8 and D5 each have an IP interface on the cloud-hosts1 vlan for that rack, even though they do not act as gateway for the Vlan. They use these IPs as the source for relayed DHCP messages sent to the install server.

Vlan1118 - cloud-hosts1-eqiad

The switches in racks D5 and C8, as well as the asw2-b-eqiad virtual-chassis, also have the legacy cloud-hosts1-eqiad Vlan (1118) configured on them. This Vlan is trunked across these switches, and cloud hosts provisioned prior to the redesign are connected to it. Cloudsw1-c8 and cloudsw1-d5 run VRRP between them over this Vlan, acting as gateway for the hosts. No specific VRRP priority is configured, so master selection is non-deterministic. Over time this Vlan will be phased out, as replacement hosts will automatically get added to the new, rack-specific Vlans. In this way all hosts will eventually use a L3 gateway in the same rack.

The Neutron / cloudnet nodes send IP packets to address 224.0.0.1 over this Vlan, using a VXLAN-based keepalive mechanism. As the switches do not have routed multicast configured, or IGMP snooping enabled, these frames are treated as broadcast by the swithces. This is not ideal, as all hosts in the Vlan receive the packets even though only a small subset need to. In the short term this adds weight to the idea of moving hosts that don't need to see them to the per-rack subnets, reducing the size of the remaining Vlan1118 broadcast domain.

Link Networks

Several Vlans are used as "link networks", with names all starting with "xlink-". These are configured on trunks between 2 switches on required ports, with a /30 IPv4 subnet configured on matching IRB/Vlan interfaces each side. These Vlans are used as a next-hop for routed traffic between switches, and to establish BGP peerings. Ideally one would use regular routed interfaces, with IPs bound directly to the interface (or sub-interfaces of it), but the inter-switch links need to be configured as L2 trunks instead, to support the stretched cloud-instances Vlan.

L2 Vlans

The production realm has no Vlans configured that operate purely at layer-2. All production realm Vlans have routed IRB interfaces on the switches which act as L3 gateway for connected hosts.

Within racks C8 and D5 the cloud-hosts vlans are extended at layer-2 to the cloudsw2 devices in those racks, to provide additional ports for end servers (not shown on diagram).

Cloudsw BGP Routing

The cloudsw1 devices in all racks run BGP in the default routing-instance, and exchange production realm prefixes with each other. The "Spine" devices, cloudsw1-c5 and cloudsw1-c8, both use AS64710. eBGP is configured over the 'xlink' Vlans to cloudsw1-e4 (AS4264710003) and cloudsw1-f4 (AS4264710004) from each. Cloudsw1-c5 and cloudsw1-c8 peer with each other using iBGP. All local networks (direct), static routes and BGP routes are enabled in the export BGP policy on all of these peerings. No more finely-grained filters are used.

BFD is configured for these sessions, with timers set to 1 second (minimum as per Juniper recommendation).

Static Routes

As the cloudsw2 devices in racks C8 and D5 are not licensed for BGP, static routes have been used on cloudsw1-c8 and cloudsw1-d5 to route the loopback IPs for them. The lack of BGP is of no particular concern as these devices are single-homed off the adjacent cloudsw1, so there is no failover requirement needing a dynamic protocol.

Cloud Realm / VRF

The below diagram provides an overview of routing in the Cloud VRF:

Routing Instance

All IP interfaces on the cloudsw devices in the cloud realm are placed into a dedicated VRF / routing-instance. This alternate routing instance has no entries for any of the production realm networks, and thus traffic cannot route directly between the cloud and production realms via the switches.

In general the routed topology for the cloud VRF is a mirror of the one in the default instance (production realm), just isolated. The cloud vrf is IPv4 only as WMCS do not yet support IPv6. As with the produciton realm config all IP interfaces support jumbo frames (required by Ceph in particular).

Static Routes

The two IPv4 ranges used internally by WMCS are statically routed to the cloudgw VIP on Vlan 1120 (cloud-instance-transport1-b-eqiad), on cloudsw1-d5 and cloudsw1-c8:

Prefix Description
172.16.0.0/21 Cloud instance (VM) IP range, ideally should not be routable from WMF production (see T209011)
185.15.56.0/25 Cloud VPS Range Eqiad
185.15.56.236/30 Link network Vlan1107 cloudgw to cloudnet transport

In addition to these an 'aggregate' route configuration is enabled on cloudsw1-c8-eqiad and cloudsw1-d5-eqiad, which peer with the core routers on site. The 'aggregate' config creates the 185.15.56.0/24 range if any contributing routes are present in the VRF routing table. This means the CR routers get the full /24 range for them, creating the route for public announcement on the internet.

CR Uplinks

Dedicated sub-interfaces are configured for the cloud vrf on the same physical CR uplinks (from cloudsw1-c8 and cloudsw1-d5) as the production realm uses.

eBGP is configured over these links, again similar to in the default table. The two static routes described in the last section are exported to the CRs, making these ranges routable from WMF production. The CR sub-interfaces connecting the cloud vrf have the cloud-in filter applied, to control what traffic is allowed between the cloud realm and WMF production.

Routed Networks

Transit Vlan

Vlan 1120 (cloud-instance-transport1-b-eqiad) is configured on cloudsw1-c8 and cloudsw1-d5, with IP subnet 185.15.56.240/29 deployed. On each switch it only has a single host connected to it. Cloudgw1001 is connected to cloudsw1-c8, and cloudgw1002 is connected to cloudsw1-d5. Given each are directly connected to only one upstream switch, they use the unique IP of that switch as their gateway. The static routes described earlier are configured with a next-hop of the shared cloudgw VIP in this subnet.

Storage Networks

The only other Vlans/subnets in the cloud realm with IP interfaces in the cloud vrf are for the cloud-storage networks. These are used by WMCS Ceph hosts as their 'cluster' network. L2 Vlans and switch IRB interface MTUs allow Jumbo Ethernet frames to pass over these Vlans, and the Ceph hosts are configured for 9,000 byte MTU. All the cloud-storage networks are within the RFC1918 192.168.0.0/16 range, and Ceph hosts need a static route for that supernet towards the cloudsw IRB IP (last in subnet) to communicate with each other. None of these ranges are announced in BGP to the CR routers, they are only used locally between the Ceph hosts, confined to the cloud vrf.

Vlan 1106 is the 'legacy' storage Vlan for Ceph hosts, and is trunked between all the cloudsw devices in racks c8/d5, as well as to the asw2-b-eqiad virtual-chassis. This is similar to Vlan1118 in the production realm. Cloudsw1-c8 and cloudsw1-d5 run VRRP between them and act as gateway for cloud hosts on this Vlan, providing VIP 192.168.4.254/24 as gateway to the other storage subnets.

TODO: Create two new, per-rack storage Vlans / subnets for racks c8 and d5, so we can move Ceph hosts to always using a local gateway.

Link Networks

As is the case in the production realm several Vlans/subnets are configured which are used for point-to-point routing between switches, BGP etc.

L2 Vlans

CloudGW to CloudNET (Neutron) Transport

Vlan 1107 (cloud-gw-transport-eqiad) is configured at layer-2 only on the cloud switches. The cloudgw and cloudnet hosts are both connected to this Vlan, and use it to route traffic between them. A /30 IPv4 subnet is configured on the hosts connecting to this Vlan. The 2 useable IPs from the subnet are both configured as floating VIPs, one per side (cloudgw/cloudnet), assigned to the active hosts at any given moment.

Cloud Instances Vlan

Vlan 1105 (cloud-instances2-b-eqiad) is used by the actual OpenStack virtual machines themselves. It is only configured at layer 2 on the cloud switches. Unlike the other Vlans in the cloud realm this segment needs to extend to all cloudvirt hosts, which can be placed in any of the 4 WMCS racks, so it needs to be stretched at layer-2 across all the cloud switches. The cloudnet hosts (OpenStack Neutron) are also connected to this Vlan, and act as L3 gateway for the cloud instances (VMs).

The current requirement to stretch this vlan between all racks poses some problems in terms of the topology.

The principal problem with the requirement to stretch layer-2 is how to deal with loops. In an Ethernet, broadcast and unknown destination frames are flooded to all ports. However, as Ethernet has no time-to-live (or similar) field, if there is a loop in the topology such frames will loop forever. This leads to broadcast storms. When initially designed (to operate on a shared physical wire,) there was no expectation that 'Ethernets' would be connected to each other, potentially introducing loops, so this wasn't an issue. When Radia Pearlman was tasked with designing the first bridges she counseled against it for this very reason.

The traditional solution to this problem is Ms Pearlman's Spanning Tree Protocol. This is not run elsewhere in the WMF, however, and presents several challenges, with some going as far as declaring that Spanning Tree is Evil. As such SRE Netops are reluctant to begin running this protocol in one small corner of the network, for one particular use case. The issue is solved differently in the production network (separate subnets per-row/rack).

Indeed, even when using techniques to extend layer-2 segments redundantly between devices, there is a natural limit to the size of an Ethernet segment, as it represents a single broadcast/failure domain. So "solving" the issue on the network side is not a long term solution. Ultimately the way to address the problem is to re-work the OpenStack networking so hosts in different racks can use different subnets. Obviously challenges such as VM mobility make this tricky, but there are various approaches as it's a common problem.

Manual Intervention

For now the Vlan has been configured without any loop in the topology. It is trunked to cloudsw1-e4 and cloudsw1-f4 only from cloudsw1-c8. This means that, for instance, if the link from cloudsw1-e4 to cloudsw1-c8 goes down, cloudsw1-e4 is disconnected from the rest of the Vlan. Should this happen a manual intervention is required, to add the Vlan to the allowed list from cloudsw1-d5. To detail the operations needed:

NOTE: As these links are internal to our infra, the likely circumstance in which a link will go down (without a switch failing) is due to manual error by someone working in the datacenter. In this circumstance the quickest remedy will mostly be to contact whoever from DC-Ops is working in the datacenter and have them reverse whatever was done.

Cloudsw1-c8 to cloudsw1-e4 link break

ADD Vlan 1105 to the 'tagged vlans' on these two ports in Netbox: (instructions)

cloudsw1-e4-eqiad - et-0/0/54

cloudsw1-d5-eqiad - et-0/0/52

REMOVE Vlan 1105 from the 'tagged vlans' on these two ports in Netbox: (instructions)

cloudsw1-e4-eqiad - et-0/0/55

cloudsw1-c8-eqiad - et-0/0/52

Then run homer from a Cumin host:

homer cloudsw1* commit "Move Vlan 1105 cloud-instances from cloudsw-e4 uplink to d5"


Cloudsw1-c8 to cloudsw1-f4 link break

ADD Vlan 1105 to the 'tagged vlans' on these two ports in Netbox: (instructions)

cloudsw1-f4-eqiad - et-0/0/54

cloudsw1-d5-eqiad - et-0/0/53

REMOVE Vlan 1105 from the 'tagged vlans' on these two ports in Netbox: (instructions)

cloudsw1-f4-eqiad - et-0/0/55

cloudsw1-c8-eqiad - et-0/0/53

Then run Homer from a cumin host:

homer cloudsw1* commit "Move Vlan 1105 cloud-instances from cloudsw-f4 uplink to d5"


NOTE: Both devices in rows E/F have this Vlan enabled on their trunks to cloudsw1-c8, as it is connected to asw2-b-eqiad, where the CloudNET hosts (gateway for the subnet) reside. As such this represents the best option currently. If/when new CloudNET hosts are deployed, in racks c8 and d5, this can be changed so E4 uses C8 and F4 uses D5. The above instructions and diagram should be updated when this happens.

Cloudsw BGP Routing

BGP is configured in the exact same way as it is for the default routing instance / production realm. All AS numbers are the same, with the cloudsw1-c8 and cloudsw1-d5 again both using 64710, and establishing eBGP sessions with the devices in racks E4 and F4. Cloudsw1-c8 and cloudsw1-d5 again have an iBGP session between them. The static routes for cloud networks that point to the cloudgw VIP are redistributed into BGP.

BFD is configured on these sessions with 1 second times similar to the production realm.

Notes

This topology can probably be extended to further racks if needed, with additional racks hanging off C8/D5 as E4 and F4 are. A limitation may be available QSFP ports on the switches in C8/D5.

Ultimately, however, this does not represent a long-term, scalable design for cloud services. The lack of automatic failover on the cloud-instances Vlan is far from ideal. It also represents significant effort for SRE Netops to continue to design and maintain a separate physical network purely for Cloud services. If WMCS continues to expand and run from the WMF Eqiad datacenter, then in the longer term the options should probably be reviewed to see if an alternate design is possible which simplifies operations for both teams.