Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh

From Wikitech
This proposal was accepted. See Portal:Cloud_VPS/Admin/Neutron.

This page describes a joint project between WMCS and SRE to reduce technical debt as well as improve reliability and security of both the CloudVPS and production realms.

Summary

The existing CloudVPS network setup relies on a heavily customized version of the Neutron OpenStack component that emulates a flat network topology. The complexity of the customization makes it difficult to introduce better security and isolation practices. This negatively affects some of the core use cases of CloudVPS, such as how Toolforge users interact with production services (wikis, APIs, dumps, wiki-replicas, etc). Lastly, the way it grew made it intertwined with the production realm, putting production security at risk while introducing technical debt.

This project contains a proposal for a new architecture that will reduce this technical debt, improve high availability, address networking separations concerns, and put us in a better position to move forward and evolve cloud services further.

Specifically, it's intended to address concerns raised in phabricator tasks:

Goals

  • A new architecture that supports future growth and expansion of WMCS without introducing further technical debt
  • Stop using neutron as the CloudVPS edge router
  • Simplify the neutron setup by offloading functionalities into cloudgw and eliminating existing custom code
  • Manage perimeter firewalling policies outside of the production core routers
  • Create better isolation between the production and WMCS realms (and remove related technical debt)
  • Remove the physical and logical dependency on single DC rows (for example eqiad row B, asw2-b-eqiad)
  • High availability of the WMCS L2 network layer by using individual control planes
  • Unblock L3 networking across production and CloudVPS realms (and related high-availability: BGP routing)

Background

This section explains the rationale and context for the proposal.

Starting point: current edge network setup

From a network point of view, the WMF production network can be viewed as the upstream (or ISP) connection for the CloudVPS service. The Neutron virtual router is defined as the gateway between the virtual network (the CloudVPS virtual network) and the upstream connection. Historically, there has been no other L3 device between Neutron and the production core router. Lacking other options, the core router also acts as a firewall for the CloudVPS virtual network.

There is static routing in the production core routers to support this setup, and past BGP experiments show the limits of Neutron as an edge router. This setup is limited to static routing only.

The virtual machines inside the CloudVPS virtual network use private addressing in the 172.16.0.0/21 range. When virtual machines contact the outside internet, there is a NAT in the Neutron virtual router that SNAT the traffic using a public IPv4 address (for example nat.openstack.eqiad1.wikimediacloud.org). The reference to this address is the routing_source_ip.

A feature called floating IP, associates a VM instance with a public IPv4 address. This floating IP address is used for all ingress/egress traffic by the VM instance.

VMs require direct access to some WMF production internal services without NAT being involved. This allows WMF services to know the particular VM instance accessing internal services. This NAT exclusion is defined as dmz_cidr. Currently some services, such as NFS, rely on this setup to have proper control about VM usage of the service.

The Neutron virtual router is implemented as a Linux network namespace in cloudnet servers. The different netns are dynamically managed by Neutron which is configured and operated using the OpenStack networking API and CLI utilities. All the routing, NAT, and firewalling done by Neutron is using standard Linux components: the network stack, Netfilter engine, keepalived for high availability. etc.

The current CloudVPS network setup is extensively described in the Neutron page. This includes documentation about both the edge/backbone network and the internal software defined network.

Technical debt

The Neutron virtual router is the edge router for the CloudVPS service internal virtual network. This is against upstream OpenStack design recommendations. Moreover, this has proven challenging for proper CloudVPS network administration, as Neutron lacks configuration flexibility as a general purpose network gateway.

In order to support the current setup, Neutron has been customized. This customization enabled Neutron to maintain compatibility with the old nova-network OpenStack component. This was a requirement during the nova-network to neutron migration that was done years ago, but it is no longer required. Code customization is a pain point when upgrading OpenStack versions. Upgrading requires rebasing all patches and testing that everything works as expected, adding unnecessary complexity to our operations.

In an ideal model, Neutron would be utilized for its expected use case: enable software defined networking (SDN) inside the cloud. Neutron wasn't designed to act as the edge router for an OpenStack-based public cloud. Decoupling some of the current Neutron virtual router responsibilities to an external server reduces this technical debt and eases the maintenance burden for CloudVPS.

Further separate CloudVPS/prod networks

Currently, CloudVPS internal IP addresses reach the production network, without NAT being involved, which is a security hazard for the prod realm, and goes against best practices, creating tech debt. This utilizes dmz_cidr as explained above. There are some needed revisions to this, (ie, phabricator T209011 - Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis). Neutron is controlling this NAT, and as described above, this cannot be easily changed. Decoupling the edge NAT/firewalling setup from Neutron would grant this flexibility.

Neutron is not designed to work as an arbitrary edge network router or firewall. Currently, all edge network firewalling for the CloudVPS service network is performed by the production core routers. The production core routers also aren't designed to be this kind of firewall. For these reasons, managing the firewalling policy for CloudVPS has been a challenge. CloudVPS should have distinct edge network firewalling.

Prepare for the future

While out of scope for this proposal, these changes intend to set the foundation for future work. See Out of Scope and Future Work.

Other considerations

  • In the current architecture, with Neutron acting as the edge router for the CloudVPS virtual network, the Neutron API and other OpenStack abstractions must be used to manage what otherwise would be a very simple setup. This can be simplified and managed directly without any Openstack abstractions (including Neutron).
  • Since the Neutron configuration lives in a MySQL database, with no external RO/RW access, and the Neutron API itself is strictly restricted, external contributors are unable to learn about or contribute to the networking setup. This is an exception to an otherwise transparent and public infrastructure. Moving edge routing configuration from Neutron allows the use git operations, akin to the current puppet model. This model is a much friendlier way of welcoming and engaging technical contributors.
  • Closely tracking and following upstream communities and projects reduces overall maintenance burden. Removing the customized networking setup is a step in this direction.
  • In FY19/20 Q4, new dedicated switches were procured for the WMCS team. These switches, called cloudsw1-c8-eqiad and cloudsw1-d5-eqiad, are already racked and connected. We refer to them generically as cloudsw. They can be leveraged to improve CloudVPS edge routing by introducing advanced L3 setups based on OSPF and BGP and improve our general L2 VLAN management and setup.

Out of Scope

The following is explicitly out of scope for this proposal.

  • Reworking storage (NFS, Ceph, etc), databases (Wiki replicas, etc) or other supporting services (such as LDAP).
  • Changing the consumption of production services by cloud users (like wikis, APIs, WDQS, dumps, etc).
  • While the proposed changes create greater network separation and autonomy, it does not address the larger issues of division of responsibility among WMCS, SRE, and other teams.
  • This project doesn't try to address datacenter design concerns for a public cloud such as CloudVPS.

While intended as Future Work, the following are out of scope:

  • Stop CloudVPS internal IPs from reaching prod networks (wiki APIs, wiki-replicas, etc)
  • Eventual rework/relocation of WMCS supporting services "closer" to the cloud (NFS, Ceph, wiki-replicas, etc)
  • Introduce tenant networks in CloudVPS (user defined routers and networks)
  • Introduce IPv6

Objectives

A. Remove physical and logical dependency on eqiad row B (asw2-b-eqiad)

Because of historical and technical limitations in OpenStack, CloudVPS hosts are racked explicitly in eqiad row B.

The current eqiad HA design is done per row, meaning production services aim to be equally balanced between the rows in the datacenter.

Since its initial deployment, WMCS has grown significantly, competing in terms of rack space and 10G switch ports in eqiad row B.

In addition, bandwidth requirements and traffic flows are different between WMCS and the production infrastructure. This brings a risk of WMCS saturating the underlying switch and router infrastructure, impacting the production environment.

Providing dedicated L2 switches and progressively moving WMCS servers to them (as they get refreshed) will eliminate those issues.

B. Standardize production<->WMCS physical and logical interconnect

The way WMCS grew within the production realm is a snowflake compared to the industry best practices for separating logically distinct networks. This "snowflakiness" and lack of clear boundary brings a few issues:

  • It increases the complexity of managing the core network, introducing technical debt
  • It increases the security risk as VMs in an untrusted domain could gain access to critical infrastructure
  • It prevents having good traffic flow visibility for traffic engineering and analysis

Configuring the above mentioned switches to additionally act as L3 gateway will help solve this issue in the short term, while providing the tools (eg. flow visibility) to fully address it in the long term.

C. High availability of the WMCS network layer

As mentioned in (A), all the WMCS are hosted on the same virtual switch. Which means maintenance or outage takes all WMCS hosts offline.

Using multiple dedicated WMCS switches sharing the same L2 domain but using individual control plane will ease maintenance and limit the blast radius of an outage.

D. Provide groundwork for WMCS growth and infrastructure independence

Due to (A) and (B), all changes to the WMCS foundations (network or other) have been challenging as they either require tight synchronization between several teams, or could cause unplanned issues due to unexpected dependencies.

Clearly defining the WMCS realm (with dedicated L2 and L3 domain) will ease significantly future changes (eg. new VLANs, ACLs, experimentation, etc) without risking impacting the production realm.

E. Reduce neutron technical debt

Drop neutron code customization, offloading the patched functions to a different device. This should allow for easier management and future improvements, including rethinking and reworking the NAT entirely (how CloudVPS users contact prod wikis, APIs, storage, etc).

The non-standard NAT/firewalling capabilities by neutron are relocated to a new device.

Future Work

Building upon the work outlined in this proposal, the diagram below shows an estimation of future stages and the expected steps to take. Note that the timeframe and stages are a rough estimation of potential next steps.

The first phase (1) is this proposal to rework the edge network. Upon completion, phase (2) in which managing and upgrading OpenStack is easier for the WMCS team is unlocked. This frees up resources in WMCS for future work.

Looking further ahead to FY21/22, phase (3) involves two majors projects: reworking how VMs, and Toolforge tools, contact the wikis (and other production services), as well as rethinking how we do storage networking.

Phase (4) involves evaluating introducing Neutron tenant networks and IPv6. This was previously evaluated in FY19/20 Q3 which can be found in Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh.

The final phase (5) allows for full isolation of CloudVPS and production network traffic. At this point, no CloudVPS private traffic should cross the production network.

This end result should address all known concerns including the aforementioned phabricator tickets.

Implementation: Proposed solutions

Three proposed options have been considered, and option 3 is the preferred one.

Option 1: use only cloudgw

this option was discarded, is only left here for reference

Option 2: use only cloudsw

this option was discarded, is only left here for reference

Option 3: use both cloudgw and cloudsw

The proposed solution comes in 2 independent parts:

  • One is to introduce cloudgw as L3 gateways for the CloudVPS network. We relocate edge NAT/firewalling functionalities into these new servers.

For the cloudgw servers we will use standard puppet management, Netfilter for NAT/firewalling, Prometheus metrics, Icinga monitoring, redundancy (HA) using standard mechanisms like keepalived, corosync/pacemaker, or similar.

  • The other is to introduce cloudsw in both L2 (all servers will be connected to them) and L3 (all servers traffic will be router through those switches).

This proposal includes a timeline with a detailed plan with different operation stages.

option 3 pros & cons

Pros:

  • This option sums all the pros from option 1 and option 2.
  • Each team has expertise managing their own devices; the WMCS team a Linux box and the SRE team a juniper network appliance.
  • A dedicated device for each set of functions: NAT/firewalling in cloudgw and BGP/OSPF in cloudsw devices.
  • Faster partial turnaround (for the prod/WMCS separation)

Cons:

  • Likewise, this option combines some of the cons from option 1 and option 2. For example, we don't have all the hardware in all the datacenters
  • A dedicated device for each set of functions: NAT/firewalling in cloudgw and BGP/OSPF in cloudsw devices.

option 3 timeline

The projected timeline is as follows:

  • stage 0: starting point, current network setup.
  • stage 1: validate cloudgw changes in codfw.
  • stage 2: enable L3 routing on cloudsw nodes. BGP between cloudsw and core routers. There is static routing between cloudsw and neutron (cloudnet servers).
  • stage 3: introduce the cloudgw L3 nodes, doing NAT/firewalling. There is static routing between cloudsw <-> cloudgw and between cloudgw <-> neutron (cloudnet servers).

Once all stages are completed, we can move on to evaluate future works, related to how the CloudVPS users contact prod services (wikis, APIs, dumps, LDAP, etc) and storage (NFS, Ceph, etc).

option 3 understanding the NAT


From the standpoint of a virtual machine instance running inside CloudVPS, there are two potential NATs that can be applied:

The neutron floating IP is used to assign public IPv4 addresses to a particular VM instance. They are software defined, and Neutron implements them by creating a DNAT/SNAT nftables rule. All traffic to/from outside the virtual network to the VM uses the specified public IPv4 addresses.

The primary use case for the floating IP is for VM instances to be directly reachable from the Internet to offer services other than standard HTTP/HTTPS, or any other use case not covered by our shared services (like the central proxies). Each CloudVPS project requires quota to use them. As of September 2020, there are 53 floating IPs in use out of 756 running VM instances.

The routing_source_ip and the dmz_cidr are the two components relocating to the cloudgw servers. Functionally they will remain the same: The routing_source_ip will do SNAT to every TCP/IP connection originating from a VM instance inside CloudVPS not excluded by dmz_cidr. However, they will run in the new location and be controlled by Puppet instead of hardcoded in the Neutron source code.

Currently, the dmz_cidr overrides the floating IP NAT. This means a TCP/IP connection to/from a VM may be excluded from the NAT by means of the dmz_cidr exclusion. After this project is implemented, traffic for VMs using floating IP (already NATed) will no longer be affected by either routing_source_ip or dmz_cidr. This means a VM using a floating IP will be seen exclusively as using a public IPv4 address outside the virtual network, including by services in the production realm (wikis, storage, etc).

option 3 implementation details

See subpage: implementation details

Meeting Notes

Additional notes

Some things that can be further evaluated.

See also

Other useful information: