2020-11-25 WMCS network checkin

Agenda:

status updates from arturo
questions, feedback
next, TODO, etc
Q3 OKR planning
- Faidon's strawdog proposal:

   "Reduce the number of ACL exceptions from the cloud tenant network to production (cloud-in4) by {N terms/N%/etc.}"

       aligned to TI-HC-FLD

   a) Complete T264993 (Audit cloud-in4 ACL)
   b) Complete & merge r641977 ("cloud: dmz_cidr: detail the list of private production addresses")
   c) Complete & merge r643269 ("Allow specific flows from 172.16/12 to prod"); carry that to dmz_cidr
   d) Reduce the list by a meaningful percentage/amount. Potentially in scope:
       - https://phabricator.wikimedia.org/T209011 (NAT wiki traffic)
       - https://phabricator.wikimedia.org/T207533 (Move labs-recursors in WMCS)
       - https://phabricator.wikimedia.org/T207543 (Move labmon (Graphite, StatsD) into a Cloud VPS)
       - https://phabricator.wikimedia.org/T207536 (parent task for support services)
       - https://phabricator.wikimedia.org/T216422 (Virtualize NFS servers used exclusively by Cloud VPS tenants)
       - others not previously documented but discovered during (a)/(b)/(c)
       
   (a), (b) and (c) can happen in the remainder of Q2, paving the road for (d) in Q3

FYI, slightly related: Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621

status updates from arturo

requested a server for 2º cloudgw device in codfw: https://phabricator.wikimedia.org/T268016
arturo's plan is once this new server arrives and we finish all the testing and validation, we move forward with eqiad and with a cloudsw device in codfw.
refreshed NFS ideas page: https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS
bootstrapped a practical guide for prod<->cloud networking:
- https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_bridging
- it was hinted in a meeting with analytics this guidelines page should be interesting for other teams as well as for ourselves.
- the source 'policy' for the guidelines is this document: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy
we can generalize the NFS architecture problem into a general one: How to 'bridge' prod/cloud when we need VMs private address contacting a prod service endpoint?
- this might be the case for both NFS and Cinder/Ceph, or others in the future.
- Arturo proposes to discuss this today
misc: 2 patches under review for clarity in network policies:
- 641977: cloud: dmz_cidr: detail the list of private production addresses | https://gerrit.wikimedia.org/r/c/operations/puppet/+/641977
- 643269: Allow specific flows from 172.16/12 to prod | https://gerrit.wikimedia.org/r/c/operations/homer/public/+/643269

notes

Arzhel thinks NFS document should be reduced in number of options. Arturo agrees.
Faidon thanks for audits to cloud-in filter etc. More reviews to come.
Faidon (a), (b) and (c) can happen in the remainder of Q2
NAT Wiki traffic -- is this more bad news for the community? This will restrict how they query the replicas, and could could introduce limits on api calls, etc, that folks are using to mitigate wiki replicas changes.
- Birgitt: Ideally we don't overload community; needs a balance to encourage buy-in
- Faidon: Intention isn't to rate limit, don't need to focus on it first if there's community impact concerns
Arzhel: IPv6 could solve some, some intelligent ordering would help
- ipv6 would be last; ipv6 requires large design changes on the end of kubernetes and gridengine CANNOT do it
Nicholas: Another potential Q3 goal is to look at Network Security Audit
- Faidon: Network/Infra Security is in SRE. May have someone to help.
Nicholas: Q2 OKR concerns?
- SRE KR's complete.
Arturo: If we can't provide a service natively within the cloud, how should we bridge them? (premade IP from VM reaching an IP from outside)
- Brooke: Can this be done without exposing the private IP?
- Brooke: For example, how to setup an OLAP view
Bridging; it's more than just network, seehttps://upload.wikimedia.org/wikipedia/labs/thumb/9/9c/NFS.png/1920px-NFS.png
Arturo: Can this interaction be generalized in some way?
Faidon: Mental model; think about it as similar to external provider wanting access to internal resources
Faidon: There should be clear lines of seperation. Think about it if there was no private backhaul to the internal network.
Arturo: Openstack has idea of provider services; provider hardware is co-located next to the cloud.
Brooke: Services that don't live in VM's, so not in a segregated space. Nothing we are doing seems like a multi-tenant network. Good to think like a VM cannot bridge, but they do exist. So we must think about it.
Brooke: In process of redesigning wiki replicas, so some of these questions are relevant today. We can't pretend it's completely external as it's not.
Faidon: Unclear if this is a special case at the moment. Provider network doesn't have to be expanded to everything in the network.
Arturo: Can we have cloud dedicated vlans on production hardware? Accesible by VM's?
Faidon: Yes, possible.
VLAN would be to host services that can't be hosted anywhere else. Inside Cloud first preference.
Faidon: We should be looking to reduce the number of crosses; the number of places things can cross
This tradeoff already exists as data is crossing.
Loki example; no ssh access from VM's, but needs to access them. Can't virtualize for reasons*
Faidon: Bare metal for users. Openstack Ironic. One tenant managed by cloud services team. This could be a solution. Nothing about loki requires it to be in production.
Arturo: How do we have something physical without being in production.
Brooke: Maybe Ironic is an option?
Faidon: Goal is to avoid bridging. Any option is open. More services shouldn't mean more exposure.
Faidon: Last time this happeend. The cloud infra project existed. Make one tenant in cloud services to run all the ancillary services for the rest of the cloud.
Brooke: Cloudinfra has worked, but has limits. "Bridging the realms". 1) Networking side, needs to be clear and understood. 2) Data flows for data services. This one is harder.
Faidon: Data services could even be thought of seperately from provider network.
Brooke: Production data somehow has to get to cloud guests.
Arturo: Do we fork production services?
Brooke: This would solve network bridging concerns, then it would only be about data flows
Faidon: Wouldn't object. But that just moves the boundary. They don't disappear, they just move.
Brooke: But it would be a cloud guest wouldn't be an attack vector for production anymore. Only for cloud.
Faidon: We're all in this together though. So don't want cloud to be owned either.

actions

Plan shared Q3 objective to reduce ACL exceptions