Portal:Cloud VPS/Admin/notes/NAT loophole

From Wikitech

2020-10-05

WMCS meeting on 2020-10-05.

summary

We tried to collect (from the top of our head) a list of current use cases of cloud-->prod connections why the current NAT loophole is required.

Basically, 3 types of use cases:

  • data replication prod --> cloud
  • rate limiting
  • access control

NFS servers are generally the main concern. They involve all 3 types of use cases. Options are anything from a full redesign and rearchitecture to an incremental update to relocate them to a different network (or addressing space) that mitigates a bit the concerns.

Some incremental changes we could do for NFS this fiscal year:

  • separate control plane (ssh) and data plane (NFS) interfaces and bridge data plane directly into the CloudVPS network, This might be bad because Neutron flat topology.
  • relocate NFS servers "inside" the cloudgw umbrella.
  • move NFS servers into cloud VMs using cinder volumes. We don't use this right now, not sure if we have enough CEPH capacity for this.
  • create dedicated cloud (non-prod) vlan to host NFS. Might also work for other supporting services (replicas?)
  • simply renumber NFS servers IP address into public IPv4 addresses.

raw notes from etherpad

Cloud VPS guests reach "internal" hosts like the Wiki Replicas, NFS servers, and the production MediaWiki servers via private IP addresses.

Problems:

  • WMCS Cloud is inside a datacenter design that is setup for web hosting (specifically Wikipedia) not necessarily for running a cloud.
  • Data must be transferred from production to cloud realm (dumps, replicas, etc)

Use Cases:

  • Prevent rate-limiting for cloud clients (maybe not super important)
    • Bryan things we can change this just by announcing the new IP range
  • Toolforge NFS
  • Dumps NFS access
  • Maps/Scratch NFS
  • wiki replica access
    • wiki replicas contain non-public information at the table level which is redacted at the view level
  • Cirrussearch replicas
  • LDAP directory
  • cloudmetrics servers live on physical hardware (prometheus)
  • OpenStack API access for VMs
    • (needs https before we can make these truly public)

Use case types:

  • data replication prod --> cloud
  • rate limiting
  • access control

"Solutions" Ideas:

  • Create dedicated cloud VLAN
  • Move all NFS servers into our internal cloudgw NAT router, preserving the NAT exception but having the hardware to live in the cloud hardware realm.
  • Move all NFS servers into cloud VMs with cinder volumes
  • On NFS servers (toolforge, maps, scratch) separate control (ssh) and data (NFS) plane interfaces.
    • Control plane address remain in whatever network
    • Data plane connection is bridged directly inside CloudVPS virtual network
  • Rebuild Wikireplicas as a scrubbed representation that is SQL queryable, but possibly done in hadoop with a joint effort from the analytics team

Move NFS servers into new VLANs (hanging from cloudsw?). Bridge into private network? Escape the NAT?

  • This only helps toolforge

Things we can do this fiscal year?

  • Rate limiting changes
  • Ask for cloud VLANs
  • Public IPs for NFS

If we production service, planning needs to consider us as supported, not a problem or exception

Questions:

  • Can production change the exception list?
    • They are moving switch config to netbox. Automating all config. WMCS asks become snowflakes, which SRE is eliminating
  • What if we used Public IP NFS servers and used a firewall? Is that progress?
  • Do our existing switches have enough space to add to?
    • No, and they are also in different rows.