Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-09-30-checkin

From Wikitech

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh

Agenda:

  • PoC Updates
  • Proposal Updates/Feedback

Dallas

Feedback:

  • Document is much clearer now :-)
  • clarify firewalling aspect. Having a perimetral firewall is a new thing for cloudvps
  • How could we prioritize security concerns of both SRE and WMCS? Can we close the dmz_cidr loophole this year?

Questions:

  • Can the existing technical debt be quantified?
    • Code in neutron limits WMCS, prevents upgrades. Maintaining own patches. Applying the patch isn't a ton of work, but have to verify patch still works, nothing broken, etc
    • Caused outages? Routing has broken, services broken in the past. Typically we find issues before releasing. Requires hand validation. ~1-2 weeks?
    • inability to have network isolation between tenants is a bigger longterm issue than the dmz patches.
    • current network topology is flat, upstream uses tenant networking
    • overall, maintenance, improvements, and security concerns with existing patches
  • NAT exceptions, dmz_cidr, we want to remove these right?
    • Yes, we want to deprecate overall. And the existing exceptions we want to get out of neutron
  • option 3 doesn't move floating ip, it moves NAT for pool of tenants between world and prod?
    • Yes. Also firewalling. The only firewall today is the core router, so want to move firewall as well.
  • What kind of firewalling? What loads?
    • Core router policy firewall. Allow contacting supporting services (nfs, wikireplicas). Managing policy has proven difficult. Policies that exist in core routers protect prod from tenants. Stateless firewall today.
    • If firewall is protecting production, it can't move outside of production right?
    • WMCS can't block unwanted services from crossing
    • VM specific policy could be relocated
    • until dmz_cidr is gone, core routers must retain firewalling policies (or at extension of prod network)
  • What's preventing us today from limiting network under dmz_cidr? Limiting subnets, limiting network, etc
    • We tried to do this specifically in the past, but has implications on wiki communities. Rate limits. Unclear how this would impact communities. For NFS, we have to know which server
    • time and expertise?
    • last time we tried, we had to rollback. Agreed it was a hard task.
      • But is a dependency a requirement? Could we work things in a different order?
      • Yes, it's possible. Doing this work, it unblocks other things. Freeing the NAT clarifies how to remove it.
  • Is it possible to delay reducing technical debt within WMCS without puytting things at risk?
  • What timing concerns does SRE have with the current proposal?
    • Want to secure infra as much as possible, and do so under Frontline Defenses this year
      • Big network security issues from cloud into production for this reason
    • Work has been delayed for this larger effort. But the loophole needs to be closed this year
  • This NAT has been a longstanding issue, and needs resolved
  • Can we move forward on Arhzel's goals?
    • Yes. UIsing BGP, stop exposing core router. In
  • Can we move forward on option 3?
    • Still TBD