Wikimedia Cloud Services team/EnhancementProposals/Iteration on network isolation

This page contains a proposal for iterating on Cloud network isolation.

The changes described here are largely a grouping of at least 3 previous projects. The goals of this iteration are:

GOAL #1 defining and agreeing in the Cross-Realm_traffic_guidelines#Case_4:_using_isolation_mechanisms
GOAL #2 introducing a shared service-abstraction / load-balancer layer for some core openstack services (like the APIs, swift, and others)
GOAL #3 better separation of network workloads at L2 / switching layer.

It is worth noting that the changes are so interwoven that it doesn't make a lot of sense to consider them as separate projects and that's why we're grouping them in a single iteration.

Cross-realm traffic guidelines case 4

Changes now live at Cross-Realm_traffic_guidelines#Case_4:_cloud-dedicated_hardware.

New cloud-realm network: cloud-private

This is a cloud-realm VLAN/subnet defined in cloud dedicated hardware switches (known as cloudsw).

Key points:

all cloud hardware servers gain a new network leg in this new VLAN.
all cloud hardware servers are still connected to the production VLAN cloud-hosts-xyz for ssh/install/monitoring/management purposes.

Implementation details:

We can call the vlans for the cloud realm control-plane traffic "cloud-private-<rack>"
- This matches the use of "private1-<rack>" in production and is hopefully fairly intuitive

As a rule of thumb all cloud servers have a leg in their local //cloud-private// subnet
- We should choose a 'supernet' for the cloud-private ranges, and allocate all the per-rack subnets from it
  - SUGGESTED: 172.20.0.0/16
- Each cloudsw will use the first IP in each dedicated subnet (i.e. .1)
- Cloud hosts will need a static route for the 'supernet' towards this IP
- This IP is also used to BGP peer from switch to CloudLB, CloudGW, CloudService nodes etc.
  - Matching what we did with the //cloud-storage// vlans
- Reason for static is we can't have two default routes, and existing default is via prod realm 10.x gateway

We can allocate a /24 from the 'supernet' to use for 'internal' /32 service IPs used within the cloud realm
A public IP block can also be assigned for 'external' /32 service IPs for services exposed to the internet
- Cloud hosts should also have a static route for the block public VIPs are taken from, if they need to connect to those, which will keep the traffic routing optimally within the cloud realm.
  - The 'labs-in' filter on the CR routers filtering traffic from the cloud prod realm should block traffic to these public VIPs from cloud host 10.x IPs, just in case in some circumstances the static route for the public VIP range is not present on a cloud host.

In Codfw the setup is largely the same with some minor differences, such as smaller network assignments. We only have a single cloudsw there which simplifies the deployment/POC.

LB layer

The reasoning for this service abstraction / LB layer is:

to abstract several services behind a single VIP, both for internet and internal-only consumption.
to don't use too many public IPv4 addresses to expose services over the internet
introduce shared redundancy and HA capabilities by means of software like HAproxy or similar.

Implementation details:

New CloudLBs will use BGP to announce service IPs / VIPs to their directly connected cloudsw
- Active/Passive is probably easiest way to operate this day 1, backup LB should announce service IPs with as-path prepended.
- If active host dies then backup routes get used instead
- HAproxy, or other software on the box, can also manipulate the BGP attributes, withdraw routes etc. to affect which LB is used
- Full active/active Anycast is also an option, but we can consider that additional complexity later probably

/32 Service IPs should be from the cloud-private supernet if the service only needs to be reachable within the cloud realm
/32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud

Cloud-private ranges are **not** announced to CR routers by the cloudsw's.

CloudLB forwards traffic back out via the cloud-private vlan to "real servers" running the various services
- HAproxy controls this
Real servers can do "direct return" (via cloudsw IP on cloud-private) for return traffic
- i.e. no need for the return traffic to route via the CloudLB in that direction

predicted usage of the new LB abstraction layer

Consider this the desired end state, in which pretty everything is already connected to cloud-private and available for becoming a backend for the new LB abstraction layer.

predicted usage of the new LB abstraction layer
Service	Access from Internet (ingress)	Access for VMs	Access for neighbors in cloud-private	Access for wikiland prod*	Access to internet (egress)	Comment
cloudcontrol	Yes	Yes	Yes	Yes, LDAP, Wikitech (2fa verification and project page management)	No, but perhaps to download base images (HTTP)	core openstack APIs. There may be some software (striker, wikitech, etc) that need access to some REST endpoints.
cloudswift	Yes	Yes	Yes, ceph	No	No	AWS/s3-like storage endpoint, abstraction for ceph storage. Needs connectivity with ceph.
cloudservices	Yes	Yes	Maybe yes?	No	Yes, recDNS	This includes UDP services (DNS). May need a later iteration on haproxy for this.
cloudrabbit	No	Yes (trove)	Yes	No	No	All openstack components need to contact rabbitmq. As long as they are all in cloud-private, this should be fine.
cloudweb	Yes	Yes	Yes	Yes, LDAP	Maybe no?	This currently includes: horizon, wikitech, striker

* wikiland prod in the above refers to resources in the wikimedia production networks that are not reachable from the public internet. Things like Wikimedia auth dns, or Wikipedia, are not considered wikiland prod as per this definition, as they are publicly accessible resources.

Open questions

As of this writing there are a couple of open questions in this project.

How to connect cross-DC

We have identified the need to do cross-DC traffic between the cloud-private VLAN. Example use case: cinder backups.

How to instrument this connection is an open question. Some options we are evaluating:

creating GRE tunnels on each cloudgw to the other cloudgw on the other DC. To support cloudgw boxes failovering, we will run iBPG between them across the GRE tunnels using https://bird.network.cz/ or similar.

creating GRE tunnels on cloudsw devides to the other cloudsw on the other DC. Problem with this is then what to do with the cloudgw egress NAT: all servers would be behind the NAT.
- netops would probably prefer not to create GRE tunnels on the switch devices. There is very limited support for it on our current devices, but it is not a feature typically found on datacenter switches, and adding a requirement for it would limit device selection in future. If the cloudgw is not deemed suitable to do the tunneling probably another external device (server or dedicated routing hardware), logically in the same place/vlans, would be best.
- NAT shouldn't be a worry. The existing NAT rule is written such that it only operates on packets from the instances subnet: ip saddr != $virtual_subnet_cidr counter accept. So we have fairly easy control over what is NAT'ed and what is not.
- The use of the GREs on the cloudgw would only involve it announcing the cloud-private subnets to the network at each site. It doesn't require they announce a default route, and thus all traffic would pass through them. Only traffic for the remote datacenter cloud-private subnets would have to flow that way.

Potential Design

The diagram below illustrates how a potential design might work.

NOTES:

Not shown is another GRE tunnel from cloudgw1002 to codfw
- This would be set up exactly the same way and provide a backup path between sites

The GRE tunnels are established between WMF prod realm IPs on the cloudgw's either side.
- This allows cloud-private traffic to tunnel over the WMF Prod WAN network between sites.
- The WMF WAN does not, however, need to know about cloud-private subnets or run separate VRFs.

The cloud-private-XX vlans can replace the cloud-instance-transport1-b-eqiad vlan.
- This is used to route public and instance traffic between cloudsw and cloudgw currently, in the cloud vrf.
- Once we add the per-rack cloud-private vlans, this traffic can go over it instead.

- The cloud-private-XX interface should belong to vrf-cloud on the cloudgw's.
  - This will allow the cloudgw to route between the cloud-private ranges and cloud-instance subnet if required.
  - The cloudgw can have FW/NFT rules to control what traffic is allowed between these.
- The GRE tunnel interfaces also belong to vrf-cloud.

Host's on the cloud-private Vlans have a static route for the cloud-private 'supernet' pointing to the cloudsw IP (.1) on the local cloud-private subnet.
- For instance if the cloud-private supernet is 172.16.22.0/18, and we have 172.16.27.0/24 allocated for Eqiad E4, then hosts in Eqiad E4 have a route for 172.16.22.0/18 pointing at 172.16.27.1.
- On cloudnet's this is part of the main netns.

Assuming no tunnelling protocol is used between CloudLB and back-end 'realservers' on the cloud-private subnets, then this setup requires CloudLB to do source and destination NAT.
- This is because the realservers may sit on other subnets in other racks, only reachable by routing across the cloudsw
- Destination NAT is needed so the cloudsw can route the packet to the correct realserver wherever it is.
- Source NAT is needed so the return traffic goes back via the CloudLB, and the NATs can be reversed correctly.
- It's understood the current cloud HA-proxy setup is already functioning in "proxy mode" so this shouldn't be a problem.

CloudDNS / authdns can use /32 service IPs in an exact mirror of the CloudLB's
- They announce a public IP over the local cloud-private vlan using BGP
- They source outbound traffic directly from this public IP (for instance configured on lo0)
- The mechanism/puppetization for Bird can be identical to CloudLBs

Dedicated cloud-public Vlan

Some services, primarily rec/authdns on cloudservice nodes, need non-HTTP internet access. One approach to this is to add "cloud-public" subnets to the cloud-vrf, each with a dedicated public subnet. The overall design would stay the same as described above, but with these new vlans added.

Doing this would mean DNS servers would not need to run a BGP daemon to announce their public IPs over the the cloud-private vlans.

Pros:

Matches exactly the current setup for auth/rec DNS on cloudservice

Cons:

Requires adding two new vlans to the setup
Uses minimum 8 public IPv4 addresses in each rack it's enabled.
- Minimum 2 racks, for redundancy between cloudservice nodes, thus minimum 16 IPs (compared to only 4 if using BGP VIPs)
  - There would be 3 free in the subnets for each rack however

Default route for servers connected to cloud-private

The default route is the route that will be used to reach the internet. The question is: how will physical servers in cloud-private connect to the internet?

Options we're evaluating:

having the default route on the cloud-realm. We'd need to NAT them, so that forces the cloud-private subnet to be behind cloudgw.
having the default route on the wikiland-prod-realm. And therefore following the rules they have there. This includes using egress HTTP proxy for things like cURL. So far we're comfortable with this approach.
- This is the status quo. But physical servers on private production subnets (i.e. cloud-hosts1-eqiad) have no direct internet access. Hosts requiring direct internet access have instead been put on production public Vlans. But we are trying to minimize use of those vlans in general within SRE, and specifically for cloud it adds a lot of cross-realm traffic we can deal with in better way.
- This means all cloud physical hosts would use the default route to connect to all wikiland services, including:
  - shared stuff like LDAP or wikitech endpoints
  - production wikis endpoints
  - auth DNS

what happens with REC DNS servers? Those need constant internet connectivity. Is it desirable to have this on the wikiland-prod realm? or better use the cloud-realm?
- we could have a cloud-public vlan for them, similar to the per-rack public/private layout that wikiland-prod has for other vlans.
  - Ideally this should route via the cloud-realm, as it's not strictly management traffic
  - A 'cloud-private' vlan is definitely a possibility, as described above
  - Alternately the cloudservice nodes, which host both auth and rec DNS services, could sit on the cloud-private subnet
    - LBs could proxy traffic to them over the cloud-private subnet, same as any other load-balanced service
    - Cloudservice hosts could also announce 2 public IPv4 address to the switches with BGP
    - Mirrors the way they have 2 IPs on prod-public right now, for separate auth and rec dns services
    - As the routes are announced to the cloud-vrf, they are reachable from the cloud-private subnet

Potential Design

If the cloudgw or other box is going to NAT traffic from cloud-private IPs, some changes to the routing on the switches would be needed.

Specifically we need to ensure that traffic from the cloud-private subnets is routed to the NAT box, and also provide the NAT box with a gateway it can use to send external traffic to the internet.

The current cloud vrf on those switches has a default route that comes from the CR routers. If cloud-private subnets are connected to that then their outbound traffic won't route via the NAT device. If we change the default to point to the NAT box, then how does the NAT box itself send traffic out to the internet?

Ultimately two separate routing instances would be required to support it. One which carries the traffic from Cloud hosts to the NAT box, and one which carries traffic from the NAT box to the CR routers, after it's been NAT'd. The only way to overcome that requirement would be to use L2 segments that cloud switches do not participate in. But we do not wish to have Vlans trunked across multiple switches and potentially create loops.

NOTE: Netops are of the opinion that if only a small number of systems (i.e. authdns) are identified that need direct outbound internet access (non-HTTP), this complexity is not worth adding. Connectivity for those systems is better provided using the BGP mechanism (similar to cloud-lb), or a direct cloud-public vlan, in that case. And just use a single VRF as before.

The 2 VRF design could be configured as follows if deemed we do need it.

PDF

NOTES:

CloudLB, CloudDNS and other devices originating /32 service IPs should also connect to both Vlans/VRFs
- One connects them to the internet externally, where they have their default route
- Once connects them to the cloud-private subnets on the inside

Proof of concept build-out

We will create a POC to validate all the setup.

Potentially available servers

As of this writing available hardware server are:

codfw:
- cloudgw2001-dev is currently spare awaiting decomm. We can repurpose it to be cloudlb. https://netbox.wikimedia.org/dcim/devices/1774/
eqiad:
- cloudswift1001/1002 are currently spare. They can be repurpose as cloudlb. https://netbox.wikimedia.org/dcim/devices/3524/ https://netbox.wikimedia.org/dcim/devices/3525/
  - moreover, andrew thinks we don't need dedicated cloudswift servers anymore, we can run radosgw on cloudcontrols

Previous work

Some references to related work that has been done previously:

for GOAL #1:

for GOAL #2

for GOAL #3
- T314847 - Separate WMCS control and management plane traffic
- Network_design_-_Eqiad_WMCS_Network_Infra