Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-10-07-checkin
Appearance
Agenda:
- Q2 Shared Planning
Action Items from last time:
- Determine how to prioritize security concerns of both SRE and WMCS sooner than currently proposed. This includes closing the dmz_cidr loophole this fiscal year per OKR's for SRE.
What services connect from prod to cloud?
Three use cases:
* Rate Limiting * Data replication from prod to cloud * Access Control
Ideas:
* Seperate data and control planes for NFS servers -- no L3 connection for data between cloud and prod * Move L3 IP addresses for NFS servers inside of cloudgw umbrella (NAT exception remains, but won't cross to prod) * Migrate NFS data to ceph cluster and utilize cinder volumes (lack storage today for everything, especially dumps) * Create a specific cloud VLAN * Move all NFS servers to public IP's (all but 1 (+1 shadow) is already public)
Q2 Planning
* Arzhel / SRE will have limited time in Q2 due to resourcing and holiday * time for PoC? ** best-effort only * shift gears and focus on the separation ideas above; less interconnected work that won't block on Arzhel
PoC
- close to being finished from switch config point of view. Only missing piece is seeing traffic flowing
- beyond that WMCS won't require Arzhel
Feel free to use other SRE Resources as needed
Questions:
- why the focus on NFS?
- We don't get any benefit until the entire proposal is finished. No incremental benefits to close smaller services
- How to manage a machine that crosses the realms (prod, cloud)? Isn't there still escalation paths if they sit in an in-between space?
- Yes, at some point need puppetmasters and monitoring. Data replication is tricky. At some point that data has to traverse boundary (prod->cloud) in order to provide data services. In particular mysql replication requires authentication (wikireplicas).
- Does IPv6 mitigate concerns?
- Right now cloud servers access private IPs. Even with ipv6, this access would continue
- Hybrid solutions perhaps less desirable?
- Separating data and control would allow for boundary designs
- Do one or more of these ideas help provide work on SRE OKR?
- Yes, seperate the giant ACL exception is the goal. WMCs should pursue ideas in this space and plan them for this year.