Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-10-07-checkin

From Wikitech

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh

Agenda:

  • Q2 Shared Planning

Action Items from last time:

  • Determine how to prioritize security concerns of both SRE and WMCS sooner than currently proposed. This includes closing the dmz_cidr loophole this fiscal year per OKR's for SRE.

What services connect from prod to cloud?

Three use cases:

   * Rate Limiting
   * Data replication from prod to cloud
   * Access Control
   

Ideas:

   * Seperate data and control planes for NFS servers -- no L3 connection for data between cloud and prod
   * Move L3 IP addresses for NFS servers inside of cloudgw umbrella (NAT exception remains, but won't cross to prod)
   * Migrate NFS data to ceph cluster and utilize cinder volumes (lack storage today for everything, especially dumps)
   * Create a specific cloud VLAN
   * Move all NFS servers to public IP's (all but 1 (+1 shadow) is already public)

Q2 Planning

   * Arzhel / SRE will have limited time in Q2 due to resourcing and holiday
   * time for PoC?
   ** best-effort only
   * shift gears and focus on the separation ideas above; less interconnected work that won't block on Arzhel

PoC

  • close to being finished from switch config point of view. Only missing piece is seeing traffic flowing
    • beyond that WMCS won't require Arzhel

Feel free to use other SRE Resources as needed

Questions:

  • why the focus on NFS?
    • We don't get any benefit until the entire proposal is finished. No incremental benefits to close smaller services
  • How to manage a machine that crosses the realms (prod, cloud)? Isn't there still escalation paths if they sit in an in-between space?
    • Yes, at some point need puppetmasters and monitoring. Data replication is tricky. At some point that data has to traverse boundary (prod->cloud) in order to provide data services. In particular mysql replication requires authentication (wikireplicas).
  • Does IPv6 mitigate concerns?
    • Right now cloud servers access private IPs. Even with ipv6, this access would continue
  • Hybrid solutions perhaps less desirable?
    • Separating data and control would allow for boundary designs
  • Do one or more of these ideas help provide work on SRE OKR?
    • Yes, seperate the giant ACL exception is the goal. WMCs should pursue ideas in this space and plan them for this year.