Wikimedia Cloud Services team/EnhancementProposals/Iteration on network isolation

From Wikitech
Jump to navigation Jump to search

This page contains a proposal for iterating on Cloud network isolation.

The changes described here are largely a grouping of at least 3 previous projects. The goals of this iteration are:

It is worth noting that the changes are so interwoven that it doesn't make a lot of sense to consider them as separate projects and that's why we're grouping them in a single iteration.

Cross-realm traffic guidelines case 4

Cross-realm-traffic-guidelines-case 4 cloud-dedicated physical network.drawio.png
Cross-realm-traffic-guidelines-case 4 wikiland.drawio.png

The relevant section in the guidelines document will be renamed to case 4: cloud-dedicated hardware and rewritten as follows:

--- case4.orig.txt	2022-10-19 14:18:40.609290127 +0200
+++ case4.new.txt	2022-10-19 14:38:01.587933327 +0200
@@ -1,6 +1,4 @@
-This case covers an architecture in which a service is offered to/from the cloud realm using hardware servers instead of software defined (virtual) components. The hardware components are part of the cloud realm.
-
-The most prominent case of this situation are the openstack services and hosts themselves. We create one or more logical/physical (software or hardware) isolation layers to prevent unwanted escalation from one realm to the other.
+This case covers an architecture in which a service is offered to/from the cloud realm using hardware servers instead of software defined (virtual) components. The hardware components are part of the cloud realm even thought they are connected to wikiland networks for ssh and management purposes.
 
 Most of the other models in this document cover either north-south native services or extremely simple east-west native services that can easily run inside the cloud. However, we often find ourselves in a dead end when trying to adapt east-west native services that cannot run inside the cloud to a north-south approach covered by other models in this document.
 In general, we find it challenging to adopt others models when a given service involves a combination of:
@@ -10,7 +8,7 @@
 * chicken-egg problems (running a mission-critical cloud component inside the cloud itself)
 * other complex architectures that require dedicated hardware.
 
-So, basically, this case covers a way to have logical/physical components that are dedicated to CloudVPS, are considered to be part of the cloud realm, and therefore aren't natively part of the production realm.
+So, basically, this case covers a way to have logical/physical components that are dedicated to CloudVPS, are considered to be part of the cloud realm, and therefore aren't natively part of the wikiland production realm except for ssh and management purposes.
 
 The upstream openstack project refers to this kind of use cases with a variety of names, depending on the actual implementation details:
 * provider hardware, hardware in service of openstack software-defined virtual machine clients.
@@ -21,21 +19,16 @@
 
 This use case must meet the following:
 * there is no reasonable way the service can be covered by the other models described in this document.
-* the hardware server must be double-homed, meaning they have 2 NICs ports, one for control plane (ssh, install, monitoring, etc) and other for data plane (where the actual cloud traffic is flowing).
-* the control/data plane is separated by an isolation barrier that has been identified as valid, secure and strong enough to meet the security demands.
-* if the situation requires it, there is a dedicated physical VLAN/subnet defined in the switch devices to host the affected services.
-* if there is a dedicated VLAN/subnet, it has L3 addressing that is also dedicated to WMCS, and is clearly separated from other production realm networks, this is for example, using '''172.16.x.x''' addressing.
-* if there is a dedicated VLAN/subnet, all the network flows are subject to firewalling and network policing for granular access control.
-* CloudVPS VMs clients accessing the services may or may not use NAT to access them.
+* the hardware server must be connected to the <code>cloud-private</code> subnet/VLAN, meaning their NIC should be trunked (native for ssh management in <code>cloud-hosts-xyz</code> and tagged for <code>cloud-private-xyz</code>).
+* all network traffic for cloud-realm-related services circulates using the <code>cloud-private-xyz</code> subnet and not <code>cloud-hosts-xyz</code>.
+* the management traffic that can circulate using <code>cloud-hosts-xyz</code> includes all usual wikiland production facilities such as puppetmasters, monitoring, LDAP, etc. Formal wikis endpoints are explicitly excluded from using this subnet.
 * in case the service needs some kind of backend data connection to a production service, such connection will use the normal network border crossing, with egress NAT on the cloud realm side and standard service mechanisms (like LVS) on production realm side.
-
-Example of isolation layers:
-* a linux network namespace
-* a docker container
-* a KVM virtual machine
+* CloudVPS VMs clients accessing the services may or may not use NAT to access them. The NAT is optional, and can be evaluated on a case by case basis.
+* if the service shall be accesses from the internet or from wikiland production services, such exposure will be using a public IPv4 address from the cloud-realm pool with associated DNS entries in the <code>wikimediacloud.org</code> domain.
+* moreover, to avoid wasting public IPv4 addresses, the new service should be behind an abstraction/load-balancing layer that is dedicated to the cloud-realm.
 
 === example: openstack native services ===
 
-Openstack native services use several layering mechanisms to isolate the 2 realms, for example:
-* neutron, linux network namespace + vlans. For a VM to cross realm it would need to escalate the vlan and/or the linux network namespace in which the neutron virtual router lives.
-* nova, kvm + vlans. For a VM to cross realm it would need to escalate vlan isolation and/or the kvm hypervisor.
+Openstack native services, such as the REST APIs or rabbitmq run in dedicated hardware servers with this setup, and in particular:
+* openstack REST APIs are behind HAproxy in <code>cloudlb</code> servers, which host the public IPv4 address for them
+* rabbitmq is contacted by both the openstack REST APIs and VMs (Openstack Trove).

The diagrams in the right will be included in the document as well.

New cloud-realm network: cloud-private

This is a cloud-realm VLAN/subnet defined in cloud dedicated hardware switches (known as cloudsw).

Key points:

  • all cloud hardware servers gain a new network leg in this new VLAN.
  • all cloud hardware servers are still connected to the production VLAN cloud-hosts-xyz for ssh/install/monitoring/management purposes.

Implementation details:

  • We can call the vlans for the cloud realm control-plane traffic "cloud-private-<rack>"
    • This matches the use of "private1-<rack>" in production and is hopefully fairly intuitive
  • As a rule of thumb all cloud servers have a leg in their local //cloud-private// subnet
    • We should choose a 'supernet' for the cloud-private ranges, and allocate all the per-rack subnets from it
    • Each cloudsw will be the default gateway for the local vlan/subnet (using .1 addr, configured within the cloud vrf)
    • Cloud hosts will need a static route for the 'supernet' towards this IP
      • Matching what we did with the //cloud-storage// vlans
    • Reason for static is we can't have two default routes, and existing default is via prod realm 10.x gateway
  • Probably makes sense to choose a /16 from 172.16.0.0/12 for the supernet, and allocate per-rack /24s from this.
  • We should probably dedicated a separate /24 from it for service IPs/VIPs

LB layer

The reasoning for this service abstraction / LB layer is:

  • to abstract several services behind a single VIP, both for internet and internal-only consumption.
  • to don't use too many public IPv4 addresses to expose services over the internet
  • introduce shared redundancy and HA capabilities by means of software like HAproxy or similar.

Implementation details:

  • New CloudLBs will use BGP to announce service IPs / VIPs to their directly connected cloudsw
    • Active/Passive is probably easiest way to operate this day 1, backup LB should announce service IPs with as-path prepended.
    • If active host dies then backup routes get used instead
    • HAproxy, or other software on the box, can also manipulate the BGP attributes, withdraw routes etc. to affect which LB is used
    • Full active/active Anycast is also an option, but we can consider that additional complexity later probably
  • /32 Service IPs should be from the cloud-private supernet if the service only needs to be reachable within the cloud realm
  • /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from internet, WMF prod or codfw cloud
  • Cloud-private ranges are **not** announced to CR routers by the cloudsw's.
  • CloudLB forwards traffic back out via the cloud-private vlan to "real servers" running the various services
    • HAproxy controls this
  • Real servers can do "direct return" (via cloudsw IP on cloud-private) for return traffic
    • i.e. no need for the return traffic to route via the CloudLB in that direction

predicted usage of the new LB abstraction layer

Consider this the desired end state, in which pretty everything is already connected to cloud-private and available for becoming a backend for the new LB abstraction layer.

predicted usage of the new LB abstraction layer
Service Access from Internet (ingress) Access for VMs Access for neighbors in cloud-private Access for wikiland prod* Access to internet (egress) Comment
cloudcontrol Yes Yes Yes Yes, LDAP, Wikitech (2fa verification and project page management) No, but perhaps to download base images (HTTP) core openstack APIs. There may be some software (striker, wikitech, etc) that need access to some REST endpoints.
cloudswift Yes Yes Yes, ceph No No AWS/s3-like storage endpoint, abstraction for ceph storage. Needs connectivity with ceph.
cloudservices Yes Yes Maybe yes? No Yes, recDNS This includes UDP services (DNS). May need a later iteration on haproxy for this.
cloudrabbit No Yes (trove) Yes No No All openstack components need to contact rabbitmq. As long as they are all in cloud-private, this should be fine.
cloudweb Yes Yes Yes Yes, LDAP Maybe no? This currently includes: horizon, wikitech, striker

* wikiland prod in the above refers to resources in the wikimedia production networks that are not reachable from the public internet. Things like Wikimedia auth dns, or Wikipedia, are not considered wikiland prod as per this definition, as they are publicly accessible resources.

Open questions

As of this writing there are a couple of open questions in this project.

How to connect cross-DC

We have identified the need to do cross-DC traffic between the cloud-private VLAN. Example use case: cinder backups.

How to instrument this connection is an open question. Some options we are evaluating:

  • creating GRE tunnels on each cloudgw to the other cloudgw on the other DC. To support cloudgw boxes failovering, we will run iBPG between them across the GRE tunnels using https://bird.network.cz/ or similar.
  • creating GRE tunnels on cloudsw devides to the other cloudsw on the other DC. Problem with this is then what to do with the cloudgw egress NAT: all servers would be behind the NAT.
    • netops would probably prefer not to create GRE tunnels on the switch devices. There is very limited support for it on our current devices, but it is not a feature typically found on datacenter switches, and adding a requirement for it would limit device selection in future. If the cloudgw is not deemed suitable to do the tunneling probably another external device (server or dedicated routing hardware), logically in the same place/vlans, would be best.
    • NAT shouldn't be a worry. The existing NAT rule is written such that it only operates on packets from the instances subnet: ip saddr != $virtual_subnet_cidr counter accept. So we have fairly easy control over what is NAT'ed and what is not.
    • The use of the GREs on the cloudgw would only involve it announcing the cloud-private subnets to the network at each site. It doesn't require they announce a default route, and thus all traffic would pass through them. Only traffic for the remote datacenter cloud-private subnets would have to flow that way.

Potential Design

The diagram below illustrates how a potential design might work.

Singel vrf.png

PDF

NOTES:

  • Not shown is another GRE tunnel from cloudgw1002 to codfw
    • This would be set up exactly the same way and provide a backup path between sites


  • The GRE tunnels are established between WMF prod realm IPs on the cloudgw's either side.
    • This allows cloud-private traffic to tunnel over the WMF Prod WAN network between sites.
    • The WMF WAN does not, however, need to know about cloud-private subnets or run separate VRFs.


  • The cloud-private-XX vlans can replace the cloud-instance-transport1-b-eqiad vlan.
    • This is used to route public and instance traffic between cloudsw and cloudgw currently, in the cloud vrf.
    • Once we add the per-rack cloud-private vlans, this traffic can go over it instead.


    • The cloud-private-XX interface should belong to vrf-cloud on the cloudgw's.
      • This will allow the cloudgw to route between the cloud-private ranges and cloud-instance subnet if required.
      • The cloudgw can have FW/NFT rules to control what traffic is allowed between these.
    • The GRE tunnel interfaces also belong to vrf-cloud.


  • Host's on the cloud-private Vlans have a static route for the cloud-private 'supernet' pointing to the cloudsw IP (.1) on the local cloud-private subnet.
    • For instance if the cloud-private supernet is 172.16.22.0/18, and we have 172.16.27.0/24 allocated for Eqiad E4, then hosts in Eqiad E4 have a route for 172.16.22.0/18 pointing at 172.16.27.1.
    • On cloudnet's this is part of the main netns.


  • Assuming no tunnelling protocol is used between CloudLB and back-end 'realservers' on the cloud-private subnets, then this setup requires CloudLB to do source and destination NAT.
    • This is because the realservers may sit on other subnets in other racks, only reachable by routing across the cloudsw
    • Destination NAT is needed so the cloudsw can route the packet to the correct realserver wherever it is.
    • Source NAT is needed so the return traffic goes back via the CloudLB, and the NATs can be reversed correctly.
    • It's understood the current cloud HA-proxy setup is already functioning in "proxy mode" so this shouldn't be a problem.


  • CloudDNS / authdns can use /32 service IPs in an exact mirror of the CloudLB's
    • They announce a public IP over the local cloud-private vlan using BGP
    • They source outbound traffic directly from this public IP (for instance configured on lo0)
    • The mechanism/puppetization for Bird can be identical to CloudLBs


Dedicated cloud-public Vlan

Some services, primarily rec/authdns on cloudservice nodes, need non-HTTP internet access. One approach to this is to add "cloud-public" subnets to the cloud-vrf, each with a dedicated public subnet. The overall design would stay the same as described above, but with these new vlans added.

Doing this would mean DNS servers would not need to run a BGP daemon to announce their public IPs over the the cloud-private vlans.

Cloud-public.png

Pros:

  • Matches exactly the current setup for auth/rec DNS on cloudservice

Cons:

  • Requires adding two new vlans to the setup
  • Uses minimum 8 public IPv4 addresses in each rack it's enabled.
    • Minimum 2 racks, for redundancy between cloudservice nodes, thus minimum 16 IPs (compared to only 4 if using BGP VIPs)
      • There would be 3 free in the subnets for each rack however

Default route for servers connected to cloud-private

The default route is the route that will be used to reach the internet. The question is: how will physical servers in cloud-private connect to the internet?

Options we're evaluating:

  • having the default route on the cloud-realm. We'd need to NAT them, so that forces the cloud-private subnet to be behind cloudgw.
  • having the default route on the wikiland-prod-realm. And therefore following the rules they have there. This includes using egress HTTP proxy for things like cURL. So far we're comfortable with this approach.
    • This is the status quo. But physical servers on private production subnets (i.e. cloud-hosts1-eqiad) have no direct internet access. Hosts requiring direct internet access have instead been put on production public Vlans. But we are trying to minimize use of those vlans in general within SRE, and specifically for cloud it adds a lot of cross-realm traffic we can deal with in better way.
    • This means all cloud physical hosts would use the default route to connect to all wikiland services, including:
      • shared stuff like LDAP or wikitech endpoints
      • production wikis endpoints
      • auth DNS
  • what happens with REC DNS servers? Those need constant internet connectivity. Is it desirable to have this on the wikiland-prod realm? or better use the cloud-realm?
    • we could have a cloud-public vlan for them, similar to the per-rack public/private layout that wikiland-prod has for other vlans.
      • Ideally this should route via the cloud-realm, as it's not strictly management traffic
      • A 'cloud-private' vlan is definitely a possibility, as described above
      • Alternately the cloudservice nodes, which host both auth and rec DNS services, could sit on the cloud-private subnet
        • LBs could proxy traffic to them over the cloud-private subnet, same as any other load-balanced service
        • Cloudservice hosts could also announce 2 public IPv4 address to the switches with BGP
        • Mirrors the way they have 2 IPs on prod-public right now, for separate auth and rec dns services
        • As the routes are announced to the cloud-vrf, they are reachable from the cloud-private subnet

Potential Design

If the cloudgw or other box is going to NAT traffic from cloud-private IPs, some changes to the routing on the switches would be needed.

Specifically we need to ensure that traffic from the cloud-private subnets is routed to the NAT box, and also provide the NAT box with a gateway it can use to send external traffic to the internet.

The current cloud vrf on those switches has a default route that comes from the CR routers. If cloud-private subnets are connected to that then their outbound traffic won't route via the NAT device. If we change the default to point to the NAT box, then how does the NAT box itself send traffic out to the internet?

Ultimately two separate routing instances would be required to support it. One which carries the traffic from Cloud hosts to the NAT box, and one which carries traffic from the NAT box to the CR routers, after it's been NAT'd. The only way to overcome that requirement would be to use L2 segments that cloud switches do not participate in. But we do not wish to have Vlans trunked across multiple switches and potentially create loops.

NOTE: Netops are of the opinion that if only a small number of systems (i.e. authdns) are identified that need direct outbound internet access (non-HTTP), this complexity is not worth adding. Connectivity for those systems is better provided using the BGP mechanism (similar to cloud-lb), or a direct cloud-public vlan, in that case. And just use a single VRF as before.

The 2 VRF design could be configured as follows if deemed we do need it.

Dual vrf.png

PDF

NOTES:

  • CloudLB, CloudDNS and other devices originating /32 service IPs should also connect to both Vlans/VRFs
    • One connects them to the internet externally, where they have their default route
    • Once connects them to the cloud-private subnets on the inside

Proof of concept build-out

We will create a POC to validate all the setup.

Potentially available servers

As of this writing available hardware server are:


Previous work

Some references to related work that has been done previously: