User:Nskaggs/draft-cloudgw

This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

This page describes the cloudgw project, a project to introduce a refresh in the CloudVPS network by reworking the edge networking by offloading functionalities from the neutron virtual router to a dedicated L3 linux box, and offloading some functionalities from prod core routers to cloudsw phyisical network switches.

Executive Summary

Reference the following tickets or similar

https://phabricator.wikimedia.org/T122406(Consider renumbering Labs to separate address spaces), https://phabricator.wikimedia.org/T174596 (dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour), https://phabricator.wikimedia.org/T209011 (Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis), https://phabricator.wikimedia.org/T207536 (Move various support services for Cloud VPS currently in prod into their own instances)

Goals

we want a new architecture that supports future growth and expansion of WMCS without introducing further technical debt.
we want to stop using neutron as the CloudVPS edge router, which is against upstream recommendations.
we want to eliminate our neutron custom code customization.
we want to simplify the neutron setup by offloading functionalities into cloudgw.
we want to manage our own perimetral firewalling policies in cloudgw instead of in the core routers.
we want better isolation between the prod and WMCS realms, specifically in the management of L2 VLAN and L3 routing.
we want to remove physical and logical dependency on single DC rows (for example eqiad row B, asw2-b-eqiad)
we want high availability of the WMCS L2 network layer by using individual control planes

Background

starting point: current edge network setup

From a network point of view, we can understand the WMF prod network as the upstream (or ISP) connection for the CloudVPS service. Our Neutron virtual router is defined as the gateway between the virtual network (the CloudVPS virtual network) and our upstream connection. There is no other L3 device between Neutron and the prod core router. Given we don't have any device with proper firewalling capabilities in the network, the core router also acts as a firewall for the CloudVPS virtual network.

There is static routing in the prod core routers to support this setup, and past BGP experiments showed the limits of Neutron for acting as a true edge router. In this setup, we are limited to static routing only.

The virtual machines inside the CloudVPS virtual network use private addressing in the 172.16.0.0/21 range. When virtual machines contact the outside internet, there is a NAT mechanism in the Neutron virtual router that SNAT the traffic using a public IPv4 address (for example nat.openstack.eqiad1.wikimediacloud.org). In our setup, we refer to this address as the routing_source_ip. We also have a feature called floating IP, which associates a VM instance with a public IPv4 address. This floating IP address will be used for all ingress/egress traffic by the VM instance.

Traditionally, we had the need for VMs to contact some WMF prod internal services directly, without NAT being involved. This is, so WMF services know the particular VM instance using the services. We implement this NAT exclusion by means of a mechanism called dmz_cidr. Currently, some of our services, like NFS, relies on this setup to have proper control about VM usage of the service.

The Neutron virtual router is implemented by means of a Linux network namespace in cloudnet servers. The different netns are dynamically managed by Neutron which should then be configured and operated using the openstack networking API and CLI utilities. All the routing, NAT and firewalling done by Neutron is using standard Linux components: the network stack, netfilter engine, keepalived for high availability. etc.

The current CloudVPS network setup is extensively described in the Neutron page. This includes documentation about both the edge/backbone network and the internal software defined network.

technical debt

We are currently using the neutron virtual router as the edge router for the CloudVPS service internal virtual network. This is against upstream openstack design recommendations, as can be seen in some docs. Moreover, this has proven challenging for proper CloudVPS network administration, given we don't have enough configuration flexibility in Neutron for managing the virtual router as a general purpose network gateway.

Additionally, for the current setup to work, we have custom code customization in Neutron. Basically the code customization was introduced to make Neutron behave like the old nova-network openstack component behaved. This was a requirement during the nova-network to neutron migration that was done years ago, but it is no longer required. The code customization is a pain point when upgrading openstack versions, given we have to rebase all the patches and test that everything works as expected, adding unnecessary complexity to our operations.

In an ideal model, Neutron would just do what it was designed for, which is enable software defined networking (SDN) inside the cloud. Neutron wasn't designed to act as the edge router for an openstack-based public cloud. So, if we offload/decouple some of the current Neutron virtual router responsibilities to an external server, we would be effectively reducing technical debt from the CloudVPS service point of view.

further separate CloudVPS/prod networks

The current setup has some flaws in the CloudVPS/prod separation topic.

Currently, CloudVPS internal IP addresses reach production network, without NAT being involved. This is by means of a mechanism called dmz_cidr which is part of our neutron custom code customization. There are some long overdue revisions to this, as can be seen for example in phabricator T209011 - Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis. Neutron is controlling this NAT, and as described above, this cannot be easily changed with enough flexibility. Therefore first step would be to address this lack of flexibility. Again, offlading/decoupling the edge NAT/firewalling setup from Neutron feels like the right move.

Neutron is not designed to work as an arbitrary edge network router or firewall. Currently, all edge network firewalling for the CloudVPS service network is implemented by means of prod core routers. The prod core routers aren't also designed to be this kind of firewall. Traditionally, in the past, managing the firewalling policy for CloudVPS in prod core routers has been a challenge for us. In an ideal separation from the prod realm, CloudVPS would have their own edge network firewalling.

Enabling more network separation between prod and the CloudVPS network could help us to eventually relocate some important services like storage (NFS, ceph) and others into CloudVPS own network.

prepare for the future

We feel that investing proper engineering time in the network architecture is long overdue in the cloud realm. We pretty much need to invest some engineering time here to ensure we prepare the ground for the future of our public cloud. This includes introducing technologies we don't currently use, like IPv6 and BGP, and brand new cloud features like Neutrn tenant networks.

It is widely accepted that the future of the CloudVPS service is completely separated from the prod architecture, and therefore any engineering time to move in that direction would be more than welcome.

other considerations

In the current architecture, with Neutron acting as the edge router for the CloudVPS virtual network, we are forced to use the Neutron API and other openstack abstraction (the very openstack design can be seen as an abstraction itself) in order to manage what otherwise would be a very simple setup. We would rather use standard linux utilities to directly manage certain components, like routing, addressing, NAT/firewalling, etc.

Since all the Neutron configuration lives in a mysql database, with no external RO/RW access, and the Neutron API itself is strictly restricted, external contributors have reduced chances to learn and contribute about our setup. This is something we would like to improve. Using git operations, like in our current puppet model or similar, is a much friendly way of welcoming and engaging technical contributors.

We also identified the need to follow more closely what upstream Linux communities and projects are doing, instead of cooking our own stuff. The Neutron custom code customization we have is just one clear example of us not following upstream patterns.

In FY19/20 Q4, new dedicated switches were procured for the WMCS team. These switches, called cloudsw1-c8-eqiad and cloudsw1-d5-eqiad, are already racked and connected. We refer to them generically as cloudsw. We can leverage these dedicated switches to improve our edge routing by introducing advanced L3 setups based on OSPF and BGP, and improve our general L2 VLAN management and setup. This, again, is the right move in the long road of an eventual full separation from the prod network.

Objectives

A. Remove physical and logical dependency on eqiad row B (asw2-b-eqiad)

Because of historical reasons and technical limitation in OpenStack, WMCS only grew in eqiad row B.

Our current eqiad HA design is done per row, which mean production (core) services aim to be equally balanced between the 4 rows we have in the datacenter.

Since its initial deployment, WMCS grew significantly, competing in term of rack space and more importantly 10G switch ports in that row.

In addition, bandwidth requirements and traffic flows are different between WMCS and the production infrastructure, which brings a risk of WMCS saturating the underlying switch and router infrastructure, impacting the production environment.

Providing dedicated L2 switches and progressively moving WMCS servers to them (as they get refreshed) will eliminate those issues.

B. Standardize production<->WMCS physical and logical interconnect

The way WMCS grew within the production realm is a snow flake compared to the industry best practice of having a PE/CE interconnect between functionally different infrastructure. This "snowflakiness" and lack of clear boundary brings a few issues:

It increases the complexity of managing the core network, introducing technical debt
It increases the security risk as VMs in an untrusted domain could gain access to critical infrastructure
It prevents having good traffic flow visibility for traffic engineering and analysis

Configuring the above mentioned switches to additionally act as L3 gateway will help solve this issue in the long term, while providing the tools (eg. flow visibility) to fully address it in the long term.

C. High availability of the WMCS network layer

As mentioned in (A), all the WMCS are hosted on the same virtual switch. Which mean maintenance or outage takes all WMCS hosts offline.

Using multiple dedicated WMCS switches sharing the same L2 domain but using individual control plane will ease maintenance and limit the blast radius of an outage.

D. Provide groundwork for WMCS growth and infrastructure independence

Due to (A) and (B), all changes to the WMCS foundations (network or other) have been challenging as they either require tight synchronization between several teams, or could cause unplanned issues due to unexpected dependencies.

Clearly defining the WMCS realm (with dedicated L2 and L3 domain) will ease significantly future changes (eg. new vlans, ACLs, experimentation, etc) without risking impacting the production realm.

Proposed solutions

Use two L3 devices

There are 2 new devices in this architecture: cloudgw and cloudsw

cloudsw is managed by the SRE team. This device manages edge BGP and OSPF.
cloudgw is managed by the WMCS team. This device implements firewalling and NAT.

The proposed solution comes in 2 independent parts:

One is to introduce 2 linux boxes acting as L3 gateways for the CloudVPS network. We refer to these servers as cloudgw. We relocate edge NAT/firewalling functionalities into these new servers.

For the cloudgw servers we will use standard puppet management, netfilter for NAT/firewalling, prometheus metrics, icinga monitoring, redundancy (HA) using standard mechanisms like keepalived, corosync/pacemaker, or similar.

The other is to introduce two switches dedicated to the WMCS realm in both L2 (all servers will be connected to them) and L3 (all servers traffic will be router through those switches) named cloudsw.

Why use a linux gateway and not dedicated network hardware?

The cloudgw setup would be very simple. Specially if you compare the setup with what Neutron does.
We contribute/integrate a bit more with the Linux upstream communities and projects.
Standard puppet workflow is a plus.
leverage already racked cloudsw devices for advanced dynamic routing.
Community ability to "see" configuration and help maintain
The flexibility of a full linux shell is interesting: scripts, debugging, tooling, etc.
The price of the hardware is not too high. Not a limitating factor. Small misc commodity boxes, with 10G NICs.
Better integration with many other external stuff, like prometheus, backups, etc.
Having a Linux box act as a gateway is not very complex in general.
Introduce HA and redundancy support is not complex either (plenty of options, keepalived, corosync/pacemaker, etc).
The Linux networking and NAT engines are industry standards.
Basically, we will be offloading some of the functions that Neutron (linux) already does to a dedicated box. The need for specific network hardware is literally zero. A linux box will suffice.
This all may better engage external contributors.
It is pretty common in corporate realms to separate routing/switching/firewalling into different components.
The cloudgw is a brand new piece of infra. This should not scare us. We do that all the time, even with more complex technologies.

Use a single L3 device

There is 1 new device in this architecture: cloudsw

cloudsw is managed by the SRE team, with WMCS also being responsible for the firewall and NAT implementation. This device is a Juniper appliance.

Discuss high level narrative and pros / cons

Proposed timeline

Those two independent parts will move in parallel to eventually be integrated together:

cloudgw in codfw
cloudsw in eqiad

This is due to not having the required equipment for staging cloudsw in codfw, as well as the urgency of solving some of the mentionned issues in eqiad where they are present.

In any case, a rough timeline of the changes in eqiad is the following:

stage 0: Done All new WMCS ceph servers are connected to dedicated WMCS switches.
stage 1: Done Route cloud-hosts1-b-eqiad vlan through cloudsw.
stage 2: changes to L3 edge routing.

cloudgw
stage 2B: introduce the cloudgw L3 nodes. They don't have any NAT/firewalling enabled yet, but we introduce required L3 routing changes to have traffic flowing through them.

stage 3: introduce basic NAT / firewalling capabilities into cloudgw servers. Relocate prod core router cloud firewalling to cloudgw.

stage 4: offload Neutron NAT / firewalling functions to cloudgw, specifically the dmz_cidr and routing_source_ip mechanisms.
stage 5: review / rework the dmz_cidr and routing_source_ip mechanisms. Evaluate entirely dropping or narrowing down the NAT exclusion mechanism for contacting prod network.
stage 6: evaluate reworking the L2/L3 setup for storage (NFS/Ceph).
stage 7: evaluate reworking the L2/L3 setup for wiki replicas

Implementation details

See implementation details

Out of Scope

While the proposed changes create greater network separation and autonomy, it does not address the larger issues of division of responsibility amongst WMCS, SRE, and other teams

Future Improvements

These improvements are intended in future work, but not explicitly in scope for this proposal

move in the right direction to eventually stop CloudVPS internal IPs reaching prod networks (wiki APIs, wiki-replicas, etc)
eventual rework/relocation of WMCS suporting services "closer" to the cloud (NFS, ceph, wiki-replicas, etc)
introduce tenant networks in CloudVPS (user defined routers and networks)

Additional notes

TODO:

cloudgw needs a leg in the cloud-host subnet for puppet etc.
shall we consider racking space issues when planning the different stages?
collect intel on why/what uses the dmz_cidr NAT exclusion mechanism.
budgeting for hardware?