Talk:Wikimedia network guidelines

Rendered with Parsoid
From Wikitech
Latest comment: 1 year ago by Alexandros Kosiaris in topic Servers MUST use a single uplink physical network connection.

Cross DC traffic flows SHOULD be capped at 5Gbps

This might be a bit difficult to enforce. It depends a lot on software capability and the service in question. In case such a service shows up, how do we plan to treat it? Case by case basis? Upgrades? Revisiting architecture/implementation? Alexandros Kosiaris (talk) 13:03, 2 September 2022 (UTC)Reply

True, that's why it's a "should" and not "must" :) It doesn't apply to most of SREs, but useful to start the conversation with the the few that might be more network heady.
We've been fortunate so far that all network heavy systems have some direct or indirect (eg. number of parallel jobs) ways of controlling their bandwidth usage.
It is indeed case by case (we're already discussing with some teams). We also get feedback during capacity planning to scale the network accordingly (with headroom for unexpected projects, up to a point). Ayounsi (talk) 18:01, 2 September 2022 (UTC)Reply

Cluster traffic exchanges within a DC SHOULD NOT exceed 30Gbps

Should we add monitoring for this? It might be very difficult to gauge the adoption rate and generated traffic pattens of a new cluster 2 years down the road. Alexandros Kosiaris (talk) 13:04, 2 September 2022 (UTC)Reply

It's a good idea to monitor clusters' traffic (Eg. all servers' egress) and have projections. We do monitor infrastructure links usage.
2 years down the road is enough time for us to adapt/scale to an organic growth. That guideline is to have something to look after and start a conversation if the trend is getting close to it. As we refresh the network hardware, that number will increase. Ayounsi (talk) 18:09, 2 September 2022 (UTC)Reply

Should this be merely called "Network guidelines"?

All the other networking pages don't include "Wikimedia" in the title. This is possibly too nit-picky, in which case feel free to ignore. :) --BCornwall (talk) 12:58, 6 September 2022 (UTC)Reply

I would like to ask that we reconsider "Not done in production for multiple reasons" as a blanket statement during the next major switch upgrade. I'll give some redundancy focused counter-points to the rationale outlined in the article:

  • Switch cost (this would double switch budget) - The person-hours spent preparing for and responding to link related events may well justify an increase to switch budget. That's not to say our switches are unstable, more that staff toil is expensive
  • Low frequency of failure - I'd argue that link/cable/nic failure frequency is in the same ballpark as disks and power, it does happen
  • Higher setup complexity (more difficult troubleshooting, special cases to adapt in server life cycle, more cabling) - It wouldn't be unmanageable, the host kernel and switch both can report on the bond status (depending on the mode) and we could deploy monitoring/alerting that clarifies the majority of failure modes.

Perhaps it could become an option for hosts that meet certain criteria as opposed to a blanket yes or no (to help mitigate budget impact). Alert, gerrit, phabricator and ganeti come to mind as services that can fall hard if networking is lost. Herron (talk) 17:04, 28 March 2023 (UTC)Reply

I think the main argument against it is that a server failure is as likely as a switch failure. So we usually deploy services redundantly across multiple servers. Where that is not possible it may be an option to do multi-chass link aggregation, but that brings more cost and significant complexity on the network side. So it should be a very last resort if there is truly no way to make the service resilient across multiple servers. Cathal Mooney (talk) 19:04, 28 March 2023 (UTC)Reply
Since Ganeti was mentioned, let me just point out that the Ganeti clusters are structured to fit the current availability zone scheme. They only host VMs that are relevant to the rack row (and thus switch) they are in and VM placement is strictly in the same row (and relevant IP subnets). They are also structured to allow for quick failover of a Ganeti node dying and/or a switch member dying, mitigating effects such as link/cable/nic/disk/power/motherboard failures.
Services that are served from VMs on Ganeti nodes, are still encouraged to follow the same principles as any other service and have instances across multiple availability zones (i.e. rack rows) to provide HA for that failure mode (switch upgrades, switch failures) or accept the inevitable downtime when it comes. That keeps things consistent with services not residing on Ganeti VMs and makes it easier to reason about the infrastructure. It also avoids adding complexity to the Ganeti infrastructure.
On a more generic note, embracing risk is what the SLO concept is about. Services setting targets and figuring out what kind of investment needs to be done to meet them is the logic we wanna go with here. Lack of an SLO, more or less means best-effort support and that means accepting downtime and falling down hard in some cases. As far as I know, neither Ganeti, nor Gerrit, nor Phabricator have any kind of an SLO (availability or otherwise) that would allow to trigger conversations about how to achieve such targets. Alert is arguably the only special case here, but if reliability of the service can be achieved at the server level, it would provide better results that just adding NICs as it would also protect from disk/power/motherboard/etc failures. As such, it's probably a wiser move to invest in that direction. Alexandros Kosiaris (talk) 09:47, 29 March 2023 (UTC)Reply