Portal:Cloud VPS/Admin/notes/NAT loophole
2020-10-05
WMCS meeting on 2020-10-05.
summary
We tried to collect (from the top of our head) a list of current use cases of cloud-->prod connections why the current NAT loophole is required.
Basically, 3 types of use cases:
- data replication prod --> cloud
- rate limiting
- access control
NFS servers are generally the main concern. They involve all 3 types of use cases. Options are anything from a full redesign and rearchitecture to an incremental update to relocate them to a different network (or addressing space) that mitigates a bit the concerns.
Some incremental changes we could do for NFS this fiscal year:
- separate control plane (ssh) and data plane (NFS) interfaces and bridge data plane directly into the CloudVPS network, This might be bad because Neutron flat topology.
- relocate NFS servers "inside" the cloudgw umbrella.
- move NFS servers into cloud VMs using cinder volumes. We don't use this right now, not sure if we have enough CEPH capacity for this.
- create dedicated cloud (non-prod) vlan to host NFS. Might also work for other supporting services (replicas?)
- simply renumber NFS servers IP address into public IPv4 addresses.
raw notes from etherpad
- Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh
- Portal:Cloud_VPS/Admin/Neutron#dmz_cidr
Cloud VPS guests reach "internal" hosts like the Wiki Replicas, NFS servers, and the production MediaWiki servers via private IP addresses.
Problems:
- WMCS Cloud is inside a datacenter design that is setup for web hosting (specifically Wikipedia) not necessarily for running a cloud.
- Data must be transferred from production to cloud realm (dumps, replicas, etc)
Use Cases:
- Prevent rate-limiting for cloud clients (maybe not super important)
- Bryan things we can change this just by announcing the new IP range
- Toolforge NFS
- Dumps NFS access
- Maps/Scratch NFS
- wiki replica access
- wiki replicas contain non-public information at the table level which is redacted at the view level
- Cirrussearch replicas
- LDAP directory
- cloudmetrics servers live on physical hardware (prometheus)
- OpenStack API access for VMs
- (needs https before we can make these truly public)
Use case types:
- data replication prod --> cloud
- rate limiting
- access control
"Solutions" Ideas:
- Create dedicated cloud VLAN
- Move all NFS servers into our internal cloudgw NAT router, preserving the NAT exception but having the hardware to live in the cloud hardware realm.
- Move all NFS servers into cloud VMs with cinder volumes
- On NFS servers (toolforge, maps, scratch) separate control (ssh) and data (NFS) plane interfaces.
- Control plane address remain in whatever network
- Data plane connection is bridged directly inside CloudVPS virtual network
- Rebuild Wikireplicas as a scrubbed representation that is SQL queryable, but possibly done in hadoop with a joint effort from the analytics team
Move NFS servers into new VLANs (hanging from cloudsw?). Bridge into private network? Escape the NAT?
- This only helps toolforge
Things we can do this fiscal year?
- Rate limiting changes
- Ask for cloud VLANs
- Public IPs for NFS
If we production service, planning needs to consider us as supported, not a problem or exception
Questions:
- Can production change the exception list?
- They are moving switch config to netbox. Automating all config. WMCS asks become snowflakes, which SRE is eliminating
- What if we used Public IP NFS servers and used a firewall? Is that progress?
- Do our existing switches have enough space to add to?
- No, and they are also in different rows.