Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-11-25-checkin

From Wikitech

2020-11-25 WMCS network checkin

Agenda:

  • status updates from arturo
  • questions, feedback
  • next, TODO, etc
  • Q3 OKR planning
    • Faidon's strawdog proposal:
   "Reduce the number of ACL exceptions from the cloud tenant network to production (cloud-in4) by {N terms/N%/etc.}"
       aligned to TI-HC-FLD
   a) Complete T264993 (Audit cloud-in4 ACL)
   b) Complete & merge r641977 ("cloud: dmz_cidr: detail the list of private production addresses")
   c) Complete & merge r643269 ("Allow specific flows from 172.16/12 to prod"); carry that to dmz_cidr
   d) Reduce the list by a meaningful percentage/amount. Potentially in scope:
       - https://phabricator.wikimedia.org/T209011 (NAT wiki traffic)
       - https://phabricator.wikimedia.org/T207533 (Move labs-recursors in WMCS)
       - https://phabricator.wikimedia.org/T207543 (Move labmon (Graphite, StatsD) into a Cloud VPS)
       - https://phabricator.wikimedia.org/T207536 (parent task for support services)
       - https://phabricator.wikimedia.org/T216422 (Virtualize NFS servers used exclusively by Cloud VPS tenants)
       - others not previously documented but discovered during (a)/(b)/(c)
       
   (a), (b) and (c) can happen in the remainder of Q2, paving the road for (d) in Q3
   

status updates from arturo

notes

  • Arzhel thinks NFS document should be reduced in number of options. Arturo agrees.
  • Faidon thanks for audits to cloud-in filter etc. More reviews to come.
  • Faidon (a), (b) and (c) can happen in the remainder of Q2
  • NAT Wiki traffic -- is this more bad news for the community? This will restrict how they query the replicas, and could could introduce limits on api calls, etc, that folks are using to mitigate wiki replicas changes.
    • Birgitt: Ideally we don't overload community; needs a balance to encourage buy-in
    • Faidon: Intention isn't to rate limit, don't need to focus on it first if there's community impact concerns
  • Arzhel: IPv6 could solve some, some intelligent ordering would help
    • ipv6 would be last; ipv6 requires large design changes on the end of kubernetes and gridengine CANNOT do it
  • Nicholas: Another potential Q3 goal is to look at Network Security Audit
    • Faidon: Network/Infra Security is in SRE. May have someone to help.
  • Nicholas: Q2 OKR concerns?
    • SRE KR's complete.
  • Arturo: If we can't provide a service natively within the cloud, how should we bridge them? (premade IP from VM reaching an IP from outside)
    • Brooke: Can this be done without exposing the private IP?
    • Brooke: For example, how to setup an OLAP view
  • Bridging; it's more than just network, seehttps://upload.wikimedia.org/wikipedia/labs/thumb/9/9c/NFS.png/1920px-NFS.png
  • Arturo: Can this interaction be generalized in some way?
  • Faidon: Mental model; think about it as similar to external provider wanting access to internal resources
  • Faidon: There should be clear lines of seperation. Think about it if there was no private backhaul to the internal network.
  • Arturo: Openstack has idea of provider services; provider hardware is co-located next to the cloud.
  • Brooke: Services that don't live in VM's, so not in a segregated space. Nothing we are doing seems like a multi-tenant network. Good to think like a VM cannot bridge, but they do exist. So we must think about it.
  • Brooke: In process of redesigning wiki replicas, so some of these questions are relevant today. We can't pretend it's completely external as it's not.
  • Faidon: Unclear if this is a special case at the moment. Provider network doesn't have to be expanded to everything in the network.
  • Arturo: Can we have cloud dedicated vlans on production hardware? Accesible by VM's?
  • Faidon: Yes, possible.
  • VLAN would be to host services that can't be hosted anywhere else. Inside Cloud first preference.
  • Faidon: We should be looking to reduce the number of crosses; the number of places things can cross
  • This tradeoff already exists as data is crossing.
  • Loki example; no ssh access from VM's, but needs to access them. Can't virtualize for reasons*
  • Faidon: Bare metal for users. Openstack Ironic. One tenant managed by cloud services team. This could be a solution. Nothing about loki requires it to be in production.
  • Arturo: How do we have something physical without being in production.
  • Brooke: Maybe Ironic is an option?
  • Faidon: Goal is to avoid bridging. Any option is open. More services shouldn't mean more exposure.
  • Faidon: Last time this happeend. The cloud infra project existed. Make one tenant in cloud services to run all the ancillary services for the rest of the cloud.
  • Brooke: Cloudinfra has worked, but has limits. "Bridging the realms". 1) Networking side, needs to be clear and understood. 2) Data flows for data services. This one is harder.
  • Faidon: Data services could even be thought of seperately from provider network.
  • Brooke: Production data somehow has to get to cloud guests.
  • Arturo: Do we fork production services?
  • Brooke: This would solve network bridging concerns, then it would only be about data flows
  • Faidon: Wouldn't object. But that just moves the boundary. They don't disappear, they just move.
  • Brooke: But it would be a cloud guest wouldn't be an attack vector for production anymore. Only for cloud.
  • Faidon: We're all in this together though. So don't want cloud to be owned either.

actions

  • Plan shared Q3 objective to reduce ACL exceptions