This proposal never got out of draft state. See Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh for it's follow up.

This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

This page contains an enhacement proposal for the CloudVPS service, specifically the network. Several goals are involved:

getting rid of some technical debt (neutron source code customizations for dmz_cidr)
enabling IPv6
enabling additional features to improve robustness and scalability of the service (multi-row networking, BGP, etc)
enable additional use cases inside CloudVPS (per-project self-services networks)

Working with openstack neutron is complex for several reasons. Mainly our current usage of openstack/neutron is apparently not very well defined upstream. And anyway openstack documentation is sometimes not clear which bits are for what model/use cases.

Constraints

Some context on the constraints we have.

VM instance identification vs NAT

One of the main constraints of our current model is that we want to preserve source address of VM instances running in CloudVPS when they contact WMF prod services (such as wikis and APIs) and also Cloud supporting services (such as NFS, LDAP, etc).

The general case for VMs running in CloudVPS is that all network traffic leaving the deployment heading to the internet will be translated to a single IPv4 address (called routing_source_ip). Traditionally, we've had many benefits of knowing exactly which VM instance is communicating with physical prod services, so there is a way to exclude VM traffic from this NAT. This is currently implemented by means of the dmz_cidr mechanism. This mechanism instruct neutron to don't do NAT for connections between VM instances and physical internal prod networks. Services running in such physical internal prod networks see the original source IP address of the VM instance (typically in the 172.16.0.0/16 range).

This mechanism, however, is not native to Openstack Neutron, and is something we added to the source code of neutron by means of patches. With every new openstack release we upgrade, the patches should be backported, which is a manual, tedious, hard to test and error-prone process. We decided this is considered technical debt and we want to get rid of this.

But we want a way to identify VM instances in the physical internal prod network.

Statements:

we would like to get rid of the NAT exclusion by means of our internal neutron customizations (dmz_cidr).
we would like to be able to uniquely identify VM instances in the physical internal prod network.

Datacenter row confinement

Traditionally, all of our CloudVPS hardware servers have been deployed in a single datacenter row (row B). This served mainly 2 purposes:

isolation: all cloud things are in a single row, a single physical prod network, with limited 'blast radious'.
topology: the CloudVPS neutron setup we use benefits from this single row, single physical prod network model. It made it relatively simple for us to implement and manage all the networking bits for CloudVPS.

There is a physical VLAN defined in row B (cloud-instances2-b-eqiad or vlan 1105) which is what our Neutron setup uses to provide network connectivity to VM instances. All VM instances have addressing from this subnet and have direct access to this VLAN.

However, we identified this confinement into a single row, single physical prod network has some consequences. First and foremost, racking space and physical capacity in a single row. Everyday we have less racking space left in row B and less capacity left in physical network switches (and maybe other physical facilities).

Also, in the past we had problems with availability/reliability, such as loosing a key component of the physical setup (a router) meaning severe downtime for the whole CloudVPS service.

Statements:

we would like a network model that allows us to cross datacenter row boundaries, meaning we could rack our CloudVPS servers in at least 2 different datacenter rows

Clear separation and no special trust

Traffic that reach services in the physical internal prod network from CloudVPS VM instances should be seen as coming from the internet, in the sense that no special trust is given to it. Network flows should reach perimetral core routers, and firewalling should be applied to them, among other measures.

Statement:

CloudVPS traffic is untrusted in the physical internal prod network.

Proposals

Some proposals.

Eliminate routing_source_ip address

This change was accepted by the WMCS team before we were reminded of the CIDR separation concern which we had lost memory of within the team. Then it was decided that the change was not desirable from the prod network point of view

Tracked in Phabricator
Task T247505

We are currently using a routing_source_ip address from the floating IP CIDR, usually the first address in the subnet.

All egress traffic of VMs not having a floating IP or not being affected by the dmz_cidr mechanism would use this address (known as nat.openstack.<deployment>.wikimediacloud.org). On the other hand, neutron would natively perform the NAT using the main router address (known as cloudinstances2b-gw.openstack.<deployment>.wikimediacloud.org).

I couldn't find a reason why we use the current setting rather than the native setting, so this proposal involves eliminating that setup.

Benefits

This change has some benefits:

Doing this change along with introducing BGP in the transport network (see below) could completely eliminate the need for any static routing setup in the core routers.
Eliminating an address from the setup means simplifying the setup a bit, which is a benefit itself.
Doing this change alone means we no longer need at least part of our custom neutron hacks (the routing_source_ip bits, we would still need the dmz_cidr mechanism though).
We free an IPv4 address that could be now used as a floating IP address by our users.
Could mean we have a smoother integration with the address scope mechanism. TODO: untested.

Drawbacks

Some drawbacks:

This change means instance traffic is seen as originating from a wikimedia production network address space, which can impacts our reputation levels on the internet.

Required changes

Some details on what would be changing:

ensure that different ACL, policies and routing bits are ready to see traffic as coming from a different address.
drop our custom neutron hacks for routing_source_ip. Restart neutron server and the l3-agents to ensure the new configuration is clean.
verify that the new setup actually works and be ready to rollback if required.
Cleanup, increase the floating IP allocating pool.
Cleanup, drop the nat.openstack.<deployment>.wikimediacloud.org FQDN

Introduce IPv6 as a replacement of dmz_cidr

The proposal is to introduce dual stack IPv4/IPv6 networking inside CloudVPS.

This would require several things to be done as well:

design and introduce an IPv6 addressing plan for CloudVPS.
introduce backbone/transport IPv6 support in the link between the prod core routers and our neutron virtual routers.
update the DNS setup (generate AAAA and PTR records for the new IPv6 addresses).
review for any other changes inside openstack and/or neutron required to support IPv6
review and update our cloud supporting services (NFS, LDAP, etc) to promote IPv6 as the preferred networking mechanism to communicate with VM instances.

Introducing IPv6 could have several benefits at the same time:

native VM instance identification mechanism, with no dependency on our NAT-based setup.
if promoting IPv6 to the prefered networking protocol inside CloudVPS, this would allow us to re-think our IPv4 NAT-based setup and reduce it to the bare minimum.
modern networking technology for our users, a step in the rigth direction from the technological point of view.

If we have a way to identify VM instances in the physical internal prod network by means of the IPv6 address, we no longer need our internal neutron customizations (dmz_cidr). Basically, we could leverage IPv6 to address the 2 constraint statements defined above.

Openstack

The upstream Openstack Rocky documents contain valuable information on how to handle IPv6. There are basically 3 ways of implementing IPv6 support: SLAAC, DHCPv6-stateless and DHCPv6-stateful, with probably SLAAC being the option to go with.

Toolforge kubernetes

One of the main challenges of this proposal is to get kubernetes to work with IPv6, specifically the Toolforge Kubernetes cluster.

According to the kubernetes upstream documentation we need at least v1.16 to run kubernetes in dual stack IPv4/IPv6 mode (by the time of this writting we are using v1.15.6).

Some additional arguments are required for a bunch of kubernetes components, like the apiserver and the controller manager. Some of them can be specified when bootstrapping the cluster with kubeadm.

Another change required is to get kube-proxy running in ipvs mode (by the time of this writting we are using the iptables mode).

The documentation also notes that Service objects can be either IPv4 or IPv6 but not both at the same time. This means the webservice mechanism would need to create 2 services per tool, setting the .spec.ipFamily field accordingly in the definition (but IPv6 can be set as default).

Per the upstream docs, no special configuration is required to get nginx-ingress working in IPv6. No special changes (other than enable IPv6) should be required in either the tools front proxy or the kubernetes haproxy.

The calico docs for IPv6 contain detailed information on how to enable IPv6 for calico, which seems straigth forward. No special changes seems to be required to get coredns working on IPv6.

Summary:

kubernetes v1.16 is required.
kube-proxy ipv6 mode is required.
activate kubeadm, apiserver, controller manager, etc IPv6 support.
activate calico IPv6 support.
enable IPv6 in webservice-created Service objects.

Timeline

Proposed timeline:

2020-xx-xx: design the IPv6 addressing plan for CloudVPS.
2020-xx-xx: introduce backbone/transport IPv6 support in transport networs. Configure Neutron with basic IPv6 support. Early testing of the basic setup.
2020-xx-xx: give the DNS setup support for IPv6.
2020-xx-xx: additional review of networking policies, firewalling ACLs and other security aspects of the IPv6 setup.
2020-xx-xx: work out IPv6 support for Toolforge Kubernetes and Toolforge in general.
2020-xx-xx: work out IPv6 support in cloud supporting services, like NFS, LDAP, Wiki replicas, etc.
2020-xx-xx: planificate and introduce IPv6 general availability.
2020-xx-xx: planificate removal of IPv4 NAT dmz_cidr mechanism.

BGP in the transport network

Tracked in Phabricator
Task T245606

Establishing a BGP session between the physical core routers and the neutron virtual routers has many benefits:

allows us to have cloudnet servers in different physical DC rows (and thus, connecting using different VLANS to the core routers)
more robust failover mechanism, given the core router would be aware of the failover too.
the right thing to do if we want to support per-project internal networks in the future

There are currently 2 important static routes defined in the physical core router for each openstack deployment we have:

the routing_source_ip and floating IP address CIDR. This is 185.15.56.0/25 in eqiad1 (nat.openstack.eqiad1.wikimediacloud.org and 185.15.57.0/29 in codfw1dev (nat.openstack.codfw1dev.wikimediacloud.org).
the internal flat network CIDR. This is 172.16.0.0/21 in eqiad1 and 172.16.128.0/24 in codfw1dev.

The static routes in the physical core router point to the external address of the neutron virtual router, i.e, 208.80.155.92/29 in eqiad1 (cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org) and 208.80.153.190/29 in codfw1dev (cloudinstances2b-gw.openstack.codfw1dev.wikimediacloud.org).

Those are the routes we will be sending via BGP to the physical core router. Please note that Neutron BGP only supports route advertising, and won't ingest any route received from the BGP peer. The default transport route is defined in the neutron transport subnet object. Also, note that for establishing the BGP peering session Neutron will use the physical host network and wont use the software-defined transport network. We will be using a dedicated IP address in each cloudnet server, asigned in the br-external bridge interface for the transport network.

The Openstack Pike upstream docs about BGP contain the following important information:

a new agent running in cloudnet servers is required: neutron-bgp-dragent. The configuration file is bgp_dragent.ini. Each cloudnet server runs an agent.
The neutron.conf file should be updated to enable the BGP dynamic routing plugin.
we need to define a common address scope for the subnets we want to advertise using BGP. We are interested in advertising the internal flat range (172.16.x.x.) and the floating IP range (185.15.x.x).
given our subnets objects are already defined, and they don't belong to a subnet pool object, we would need to manually update the database to associate our current subnets to a subnet pool (and then to an address scope).
create the neutron bgp speaker object, and associate it with the external transport network.
create the neutron bgp peer object, and associate it with the bgp speaker created above.
manually schedule the bgp speaker agent to run in each neutron-bgp-dragent (for HA).
have the BGP session established and sync with the upstream BGP peer (the physical core router).

Drawbacks

Using BGP in our particular context and use case may have some drawbacks:

the main original goal was to get rid of static routes in the coure routers. The neutron BGP speaker only announce prefixes for the internal networks (VM network) and a prefix for each floating IP address. This does not include the route for the routing_source_ip address. Thus, we would need to keep this static route in the core router anyway, meaning the benefit of using BGP would be a bit limited. This would be invalidated if in a future we enable self-service tenant networks, in which case using BGP+static routes is probably the way to go.
the address scope mechanism has a really bad interaction with our NAT setup (see dedicated section below).

We could potentially eliminate these drawbacks if we introduce the proposal described in the Eliminate routing_source_ip address section (see above).

Debian packages

In our servers, we need to install the following debian packages and their dependencies (assuming openstack pike):

in cloudnet boxes: python3-os-ken (>= 0.4.1-2) neutron-bgp-dragent (>= 11.0.0-1)
in cloudcontrol boxes: neutron-dynamic-routing-common (>= 11.0.0-1) python-neutron-dynamic-routing (>= 11.0.0-1)

Configuration files

Openstack configuration file for cloudnet servers, file /etc/neutron/bgp_dragent.ini:

[DEFAULT]

[BGP]
# BGP speaker driver class to be instantiated. (string value)
# Beware at some point upstream switched to the OsKen driver. For Pike, it is the Ryu driver.
bgp_speaker_driver = neutron_dynamic_routing.services.bgp.agent.driver.ryu.driver.RyuBgpDriver

# 32-bit BGP identifier, typically an IPv4 address owned by the system running the BGP DrAgent. (string value)
# Use dedicated IP address (br-external.cloudnet2002-dev.wikimedia.org, etc)
# I don't expect any BGP network traffic to be reaching neutron in this address in any case
bgp_router_id = 208.80.153.188

NOTE: the router_id value is unique per server.

Openstack configuration file in cloudcontrol servers, file /etc/neutron/neutron.conf:

[DEFAULT]
[..]
service_plugins = router,neutron_dynamic_routing.services.bgp.bgp_plugin.BgpPlugin
[..]

codfw1dev

Specific configuration data for the codfw1dev deployment (thanks Arzhel for the hints).

Wikinetwork codfw side:

AS14907

cr1-codfw IP: 208.80.153.186

cr2-codfw IP: 208.80.153.187

WMCS codfw side:

AS64711

br-external.cloudnet2002-dev.wikimedia.org: 208.80.153.188

br-external.cloudnet2003-dev.wikimedia.org: 208.80.153.189

The transition from static to BGP is impact-less and can be done anytime (with a small notice to prepare the config). Steps are:

configure BGP and verify it's behaving as expected. At this point the static routes will still have the priority
Remove the static routes and VRRP/VIP config. Need BGP (and optionally BFD) ports open and listening to the routers IPs.

experimentation summary from 2020-03-18

After we conducted experiments, we discovered several limitations. Here is the summary of what happened:

Neutron BGP is outbound only, so we would still need to keep the VRRP VIP between cr1 and cr2 and a static route from cloud -> core
Neutron BGP doesn't allow to setup BGP from an interface managed by Neutron, in this case the cloudnet subinterface with a leg on the transport subnet
This means Arturo had to create manual sub interfaces, only for BGP
- Those interfaces are only for BGP, client traffic still goes through the Neutron managed VIP shared between the two cloudnet
- They can't have iptables, as Neutron manages iptables, but doesn't manage those interfaces

So, basically:

The good side: we can get rid of the static routes from cores -> cloudnet-VIP
It requires requires extra firewall policies on the core side (tech debt) to protect those non iptableables IPs/interfaces
It doesn't improve failover as the cloudnet and VRRP VIPs need to stay

The current Neutron BGP implementation might be just for hypervisors (cloudvirts) to peer with the top of rack switches, and not for cloudnet to peer with the outside internet.

Use VXLAN in the cloud network

Tracked in Phabricator
Task T248881

Neutron supports the VXLAN protocol as the underlying mechanism driver to implement the virtual network overlay. Upon research, the reference implementation uses openvswitch as the engine to process this type of network.

The goal of this proposal is to introduce a new networking model that allow us to overcome the limitations of the physical production network in the datacenter by allowing us to directly connect hypervisors racked in different rows (different l3 subnets).

Each hypervisor host interface acts as a VXLAN tunnel endpoint to which other hypervisors in any other different row can connect, thus creating a VXLAN mesh across the whole datacenter.

considerations

Some things to take into account.

Which subnet/vlan do we use to build the tunnels? we might want to build the vxlan tunnels in the cloud-hosts-xxx vlans (10.x.x.x prod addressing) so we can effectively connect to other tunnel endpoints in other subnets.
If the above is true, then we end up with a spare 10G NIC in each cloudvirt. This spare NIC can be either teamed/bonded/port-channeled for additional redundancy/capacity or leave disconnected.
Some people seem to recommend the underlay network where the overlay network is being build upon should use jumbo frames (higher MTU). This is something that may have implications in the shared prod switches.

intermediate router/firewall

This proposal is being developed in Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh_cloudgw

Another option that could help us in our use case and in the interaction between physical internal prod networks and CloudVPS internal networks is to have an intermediate router/firewall between the prod core routers and the neutron virtual router.

The setup involves a network node that does the IPv4 NAT plus packet policying (firewalling). This node could have a direct connection to a new network to host supporting services (such as NFS, Ceph, etc) so connecting to them doesn't involve interacting with the physical internal prod network. The IPv4 NAT in this node means we no longer need to do it in the Neutron virtual router (so we can get rid of our code customizations). If this network node is a Linux box, then we can do proper statefull firewalling in a much fine-grained fashion than we do now with the prod core router. This setup is also fully compatible with IPv6, given it simplifies a bit the Neutron setup. Also, it should be compatible with BGP.

This proposal complies with all the constraint detailed in this document, but has some drawbacks:

to eliminate SPOF, there should be at least 2 networks nodes providing redundancy and failovering capabilities.
2 more servers/router to develop and maintain.
involves buying new hardware, which involves budgeting, procurement, racking, etc.
will need to re-evaluate how these servers would be installed/monitored, etc. Are they still "prod" realm? If we relocate supporting services (such as NFS, Ceph, etc), what about those?

In summary, this proposal:

introduces a pair of network nodes (preferably Linux boxes) to work as redundant routers/firewall.
allows us to eliminate the Neutron code customizations for NAT. We can do IPv4 NAT in the new network nodes.
we can introduce a new subnet to host supporting services (such as NFS, Ceph, etc) to gain better l3 isolation between them and physical internal prod networks.
is compatible with any IPv6 improvements.
is compatible with any BGP improvements.

Other options

Other options that were considered (and discarded).

neutron address scopes

Tracked in Phabricator
Task T244851

Neutron has a mechanism called address scopes which at first sight seems like the right way to replace our dmz_cidr mechanism. With address scopes you can instruct Neutron to do (or do not) NAT between certain subnets.

Using this mechanism, we would need to create an address scope (let's call it no-nat-address-scope) and then associate the internal instance virtual network subnet with a subnet-pool for it. The database can be hacked to associate an existing subnet with a new subnet pool.

This option was evaluated but several blockers were found:

the scope mechanism works per neutron router interface and has nothing to do with addressing.
the external networks for which we want to exclude NAT are actually external to neutron in the sense that neutron is not aware of them. We would need Neutron to have a direct interface in the affected target subnets.

The dmz_cidr is not correctly implemented using address scopes. After configuring neutron this doesn't work as expected. This is not the functionality we are looking for.

neutron configuration:

root@cloudcontrol2001-dev:~# openstack address scope create --share --ip-version 4 no-nat
root@cloudcontrol2001-dev:~# openstack address scope list
+--------------------------------------+--------+------------+--------+---------+
| ID                                   | Name   | IP Version | Shared | Project |
+--------------------------------------+--------+------------+--------+---------+
| b8e9b95f-150e-4236-afba-8b1f3105e81c | no-nat |          4 | True   | admin   |
+--------------------------------------+--------+------------+--------+---------+
root@cloudcontrol2001-dev:~# openstack subnet pool create --address-scope no-nat --no-share --no-default --default-quota 0 --pool-prefix 10.0.0.0/8 --pool-prefix 208.80.152.0/22 external-subnet-pools
root@cloudcontrol2001-dev:~# openstack subnet pool create --address-scope no-nat --share --default --default-quota 0 --pool-prefix 172.16.128.0/24 cloud-instances2b-codfw
root@cloudcontrol2001-dev:~# openstack subnet pool list
+--------------------------------------+-------------------------+-----------------------------+
| ID                                   | Name                    | Prefixes                    |
+--------------------------------------+-------------------------+-----------------------------+
| 01476a5e-f23c-4bf3-9b16-4c2da858b59d | external-subnet-pools   | 10.0.0.0/8, 208.80.152.0/22 |
| d129650d-d4be-4fe1-b13e-6edb5565cb4a | cloud-instances2b-codfw | 172.16.128.0/24             |
+--------------------------------------+-------------------------+-----------------------------+
root@cloudcontrol2001-dev:~# openstack subnet pool show 01476a5e-f23c-4bf3-9b16-4c2da858b59d
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| address_scope_id  | b8e9b95f-150e-4236-afba-8b1f3105e81c | <---
| created_at        | 2020-02-11T13:45:20Z                 |
| default_prefixlen | 8                                    |
| default_quota     | 0                                    |
| description       | external networks with no NATs       |
| id                | 01476a5e-f23c-4bf3-9b16-4c2da858b59d |
| ip_version        | 4                                    |
| is_default        | False                                |
| max_prefixlen     | 32                                   |
| min_prefixlen     | 8                                    |
| name              | external-subnet-pools                |
| prefixes          | 10.0.0.0/8, 208.80.152.0/22          |
| project_id        | admin                                |
| revision_number   | 0                                    |
| shared            | False                                |
| tags              |                                      |
| updated_at        | 2020-02-11T13:45:20Z                 |
+-------------------+--------------------------------------+
root@cloudcontrol2001-dev:~# openstack subnet pool show d129650d-d4be-4fe1-b13e-6edb5565cb4a
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| address_scope_id  | b8e9b95f-150e-4236-afba-8b1f3105e81c | <---
| created_at        | 2020-02-11T16:59:02Z                 |
| default_prefixlen | 24                                   |
| default_quota     | None                                 |
| description       | main subnet pool                     |
| id                | d129650d-d4be-4fe1-b13e-6edb5565cb4a |
| ip_version        | 4                                    |
| is_default        | True                                 |
| max_prefixlen     | 32                                   |
| min_prefixlen     | 8                                    |
| name              | cloud-instances2b-codfw              |
| prefixes          | 172.16.128.0/24                      |
| project_id        | admin                                |
| revision_number   | 0                                    |
| shared            | True                                 |
| tags              |                                      |
| updated_at        | 2020-02-11T16:59:02Z                 |
+-------------------+--------------------------------------+
root@clouddb2001-dev:~# (mariadb-neutron DB) update subnets subnetpool_id='f53ad212-6809-492b-b63e-81f6739a56eb' where id='7adfcebe-b3d0-4315-92fe-e8365cc80668';
root@cloudcontrol2001-dev:~# openstack subnet list
+--------------------------------------+------------------------------------+--------------------------------------+-------------------+
| ID                                   | Name                               | Network                              | Subnet            |
+--------------------------------------+------------------------------------+--------------------------------------+-------------------+
| 31214392-9ca5-4256-bff5-1e19a35661de | cloud-instances-transport1-b-codfw | 57017d7c-3817-429a-8aa3-b028de82cdcc | 208.80.153.184/29 |
| 651250de-53ca-4487-97ce-e6f65dc4b8ec | HA subnet tenant admin             | d967e056-efc3-46f2-b75b-c906bb5322dc | 169.254.192.0/18  |
| 7adfcebe-b3d0-4315-92fe-e8365cc80668 | cloud-instances2-b-codfw           | 05a5494a-184f-4d5c-9e98-77ae61c56daa | 172.16.128.0/24   |
| b0a91a7b-2e0a-4e82-b0f0-7644f2cfa654 | cloud-codfw1dev-floating           | 57017d7c-3817-429a-8aa3-b028de82cdcc | 185.15.57.0/29    |
+--------------------------------------+------------------------------------+--------------------------------------+-------------------+
root@cloudcontrol2001-dev:~# openstack subnet show 7adfcebe-b3d0-4315-92fe-e8365cc80668
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| allocation_pools  | 172.16.128.10-172.16.128.250         |
| cidr              | 172.16.128.0/24                      |
| created_at        | 2018-03-16T21:41:08Z                 |
| description       |                                      |
| dns_nameservers   | 208.80.153.78                        |
| enable_dhcp       | True                                 |
| gateway_ip        | 172.16.128.1                         |
| host_routes       |                                      |
| id                | 7adfcebe-b3d0-4315-92fe-e8365cc80668 |
| ip_version        | 4                                    |
| ipv6_address_mode | None                                 |
| ipv6_ra_mode      | None                                 |
| name              | cloud-instances2-b-codfw             |
| network_id        | 05a5494a-184f-4d5c-9e98-77ae61c56daa |
| project_id        | admin                                |
| revision_number   | 1                                    |
| service_types     |                                      |
| subnetpool_id     | d129650d-d4be-4fe1-b13e-6edb5565cb4a | <---
| tags              |                                      |
| updated_at        | 2019-10-02T15:27:33Z                 |
+-------------------+--------------------------------------+

There shoud be no NAT between subnets in the same address scope, but ping test is wrong:

13:59:32.452843 IP cloudinstances2b-gw.openstack.codfw1dev.wikimediacloud.org > codfw1dev-recursor0.wikimedia.org: ICMP echo request, id 21081, seq 1, length 64
13:59:32.452883 IP codfw1dev-recursor0.wikimedia.org > cloudinstances2b-gw.openstack.codfw1dev.wikimediacloud.org: ICMP echo reply, id 21081, seq 1, length 64

ping test using the current dmz_cidr mechanism (the expected behaviour):

14:05:32.173816 IP 172.16.128.20 > codfw1dev-recursor0.wikimedia.org: ICMP echo request, id 21607, seq 6, length 64
14:05:32.173848 IP codfw1dev-recursor0.wikimedia.org > 172.16.128.20: ICMP echo reply, id 21607, seq 6, length 64

neutron address scopes revisited

The CloudVPS configuration today uses a single transport network, subnet and router interface as the default gateway. Because of this we cannot configure address scopes to selectively disable NAT on traffic to internal core services and RFC1918 address.

In order to take full advantage of address scopes, each subnet pool that belongs to an address scope must have a dedicated network and interface attached to the Neutron virtual router.

Using the router's routing table (cloudnet network namespace) traffic is routed to an attached interface
1. Traffic with no route uses the default gateway on the qg- interface
2. Traffic with a matching route maps to a qr- interface
Once an interface has been identified, the connection traffic flows through that interface into IPTABLES
1. Interfaces configured with a network and subnet pool associated with an address scope will be marked with IPTABLES
2. Disable NAT if both the source and destination interfaces have been marked by IPTABLES
3. Continue with the default policy enabling NAT for all other traffic

neutron direct connection to physical internal prod networks

Another option would be to give the neutron router a direct connection to the affected physical internal prod networks. This way, Neutron is fully aware of those networks (it has addressing in each VLAN/subnet) and since there is a direct route, no NAT needs to happen and VMs can connect directly preserving the source IP address.

The approach is also interesting because it could allow us to better leverage the Neutron address scope mechanism (see above). Given Neutron is fully aware (and connected) to the different prod networks, with different physical interfaces, the address scope implementation could theoretically work in this environment.

This option has been discarded because it violates the clear separation constraint (see above).

Wikimedia Cloud Services team/EnhancementProposals/Network refresh

Constraints

VM instance identification vs NAT

Datacenter row confinement

Clear separation and no special trust

Proposals

Eliminate routing_source_ip address

Benefits

Drawbacks

Required changes

Introduce IPv6 as a replacement of dmz_cidr

Openstack

Toolforge kubernetes

Timeline

BGP in the transport network

Drawbacks

Debian packages

Configuration files

codfw1dev

experimentation summary from 2020-03-18

Use VXLAN in the cloud network

considerations

intermediate router/firewall

Other options

neutron address scopes

neutron address scopes revisited

neutron direct connection to physical internal prod networks

See also