Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-09-16-checkin

From Wikitech

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh

Agenda:

  • Overview of design
  • Experiment status / results
  • Q and A

High Level Goals:

  • Ensure SRE / WMCS can address longstanding networking concerns
  • Have a shared design document to guide us through the rest of the fiscal year

High Level View

https://upload.wikimedia.org/wikipedia/labs/thumb/e/ea/Cloudgw_new_device.png/1000px-Cloudgw_new_device.png

Please take notes :-)

Experiments: Doing PoC in Dallas on the cloud side. Network switch hardware isn't there, so steps are being outlined for how they will be done in eqiad for the cloudsw hardware.

Do we agree on high level goals?

  • Faidon:
    • Yes and no :-)
    • Appreciate the doc, the meeting today, the efforts happening
    • Challenges on the "why" - it seems to blend 2-3 projects, and it not clear where the why for each of those starts (raced that before in convos with some of you)
    • Experiments happen in parallel, in different DCs, not clear yet how they benefit each other

Nicholas:

  • Yes, proposal is a proposal to blend needs from both teams
  • Would you agree unified doc/vision is beneficial?

Faidon: Fully agree on unified vision.

  • Difficult to seperate it from the "why" though

Nicholas:

  • The doc should reflect which path we took and why
  • Maybe we can seperate out the "why"s better

Faidon:

  • Not clear what is the added benefit is to have both cloudsw and cloudgw
    • Describe that better?

Faidon:

  • Long-term ask for L3 seperation between realms
  • Decision making is better when using a framework; what decisions can we make here? What options do we have? What else was considered and not proposed?

Arturo:

  • This is only focused on edge network, not an architecture or reworking all supporting services
  • 2 pieces
    • What is cloud vps? why use it over AWS?

Relationship with physical DC layout isn't conducive to a public cloud while important, this is out of scope for this work :-)

Brooke:

  • Some aspects are still unclear at this point as we are unsure of the breadth of changes required
  • First part is figuring out neutron, then could add timelines

Nicholas:

  • This project isn't intended to address longer-term view of responsibilities over shared components. It should seek to remain neutral

Action Items:

   Separate out concerns with clear pros/cons
   Separate out future vs now
   Add a timeline, why we are doing X and when we are doing it
   Capture thoughts better
   We are prevented from doing X, so we are doing Y
   List standing issues (link tickets?), describe when and how it will be addressed if possible
   Establish meeting cadence for feedback and review


Questions to discuss:

  • Why 2 L3 devices?
    • Mostly just shifting workload that exists today; the device is "hidden" today.
    • Team responsible is more comfortable with gateway device
    • Reduces risk on core routers; allows for experimentation, with clear seperation of risk from production
  • What are the realm boundaries of responsibility between devices? (IE, shared services like NFS aren't changing). How does this grant more autonomy?
    • Not today, but today's model isn't sustainable. In the future, this autonomy will be sought.
  • Does this proposal address L3 separation between cloud and production?
    • Yes, this work is intended to solve this; the stages of the document outline the steps
  • Does the proposal make hard commitments?
    • No, but need to test. Major implications for how things interact.