Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2021-02-03-checkin

From Wikitech

2021-02-03 WMCS network checkin


  • wiki replicas


cloudVPS NAT

  • CloudVPS NAT wiki changes: several moving parts
  • faidon: how can we help
  • arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
  • faidon: try talking to each team managers for coordination
  • nicholas: timeline needs to be extended
  • faidon: yes, ACK complexity
  • arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
  • faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
  • faidon: what about drop not every exception at the same time but progressively
  • bstorm: bot accounts store IP addresses, how do we handle that
  • arturo: we could drop requests per DC
  • faidon: All traffic should be running through eqiad
  • brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
  • faidon: let's try to break down the problem into smaller pieces
  • brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
  • nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker
  • How to do NAT pooling?
  • faidon: can we patch neutron?
  • arturo: we are moving away from patching
  • arzhel: ipv6 would help here
  • faidon: want to avoid tying this work to ipv6

wiki replicas

  • brandon: Are we trying to get rid of cloud VLANS or ?
  • bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
  • brandon: What other services will be LVS? Are there more VLANs coming?
  • arturo: Understand LVS to be part of solution for handling "public" traffic.
  • faidon: Why do wiki replicas today need to be in?
  • bstorm: no technical reason. Legacy, presumption?
  • faidon: access by anything besides NAT'd network?
  • bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
  • brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
  • bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
  • bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
  • arturo: Expection is wiki replicas is an exception, and future services will do something else
  • faidon: Should plan for LVS future. Understand migration and timelines
  • nicholas: once the old cluster is gone, what's blocking?
  • faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
  • faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
  • arturo: very small machines, easily fixed
  • nicholas: perhaps by the end of the FY we can get rid of the old cluster
  • faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it