Jump to content

Wikimedia network guidelines

From Wikitech

A collection of guidelines to follow for an efficient use of our network. With their exceptions.

Please reach out to SRE/Infrastructure Foundations if you need help designing a service or if something doesn't fit those guidelines.

Except special cases listed below, servers MUST use a single uplink physical network connection.

When needing access to more than 1 logical network (VLAN) those MUST be trunked, that is all carried VLANs must be tagged when ingressing/egressing the physical network interface.

Redundancy

When configured in a active/passive mode, if any of the [switch|switch interface|cable|server interface] fails, the alternate link takes over.

Not done in production for multiple reasons:

  • Switch cost (this would double switch budget)
  • Low frequency of failure
    • It's more likely that a server fail or needs to be restarted than a cable or interface fails, e.g. even primary DBs use a single uplink
  • Higher setup complexity (more difficult troubleshooting, special cases to adapt in server life cycle, more cabling)

One exception is the Fundraising Infrastructure as they're both critical (negating the low frequency of failure argument) and have a tiny footprint, that is 2 racks (mitigating the cost argument) allowing them to have 2 Top of the Rack (ToR) switches. The complexity downside applies here too (see task T268802)

Capacity

When configured in an active/active mode, 2x10G links will have more bandwidth than 1x10G link.

Not done in production for multiple reasons:

  • Switch cost (this would double switch budget)
  • No need. Currently services rarely require more than 10G per server. This could change in the future.
    • New Top of the Rack (ToR) switches support up to 25G uplinks (with 40/50/100G for exceptional cases)
  • SPOF risk. If a service can scale horizontally (by adding nodes, that is), it is definitely better than scaling vertically (increase node size).
  • Potential for usage of > 50% of total capacity. e.g. server is pushing >10G through 2 NICs and 1 link fails, the other link gets saturated.
  • Backbone capacity, while we're upgrading backbone links and gear, significantly increasing capacity for a few servers could cause congestion in legacy parts of the infra
  • Higher setup complexity (more difficult troubleshooting, special cases to adapt in server life cycle, more cabling)

Required L2 adjacency to physically distant networks

LVS - This use-case will go away with the future L4LB project.

Failure domains

Services MUST tolerate a failure domain to fail

By spreading servers across multiple failure domains (In other words, not put all our eggs/servers in the same basket/failure domain).

In networking (and in our network) multiple elements are considered failure domains:

Virtual Chassis

Our legacy network design includes the usage of the Juniper Virtual Chassis technology. While this eases management, there is increased risk of all members of a VC failing together, by bug, miss-configuration or maintenance. Failure of a single VC member is possible and has happened, but conceptually speaking, it is better to use the entire VC as a failure domain when designing services.

You can find the list of active VC on Netbox. For production they're the A/B/C/D rows in eqiad/codfw as well as esams/ulsfo/eqsin switches.

The new network design doesn't have this constraint, for example F1 and F2 are distinct failure models.

L2 domains

There is little control that can be done on traffic transiting a L2 domain, any device miss-behaving will impact all the other servers, a cabling issue can flood the network, efficient scaling is not possible.

For production the legacy network design conveniently fit the VC models (L2 domains are stretched across the VC, eg. private1-a-eqiad).

The new network design restrict the L2 domains to each of the Top Of the Rack (ToR) switches.

For example a service with 3 servers MUST NOT have those severs in eqiad A1 A2 A3, but rather A1 B1 C1, or E1 E2 E3.

WMCS

In the cloud realm, the cloud-instances vlan is stretched across multiple switches.

Ganeti clusters

Ganeti clusters follow the L2 domains, each cluster is thus a matching failure domain.

Core DCs

At a higher level, eqiad and codfw are disaster recovery pairs. Critical services should be able to perform their duty from the secondary site if the primary one becomes unreachable. See Switch Datacenter.

Public IPs

Except special cases, servers MUST use private IPs

Services requiring public Internet connectivity can be deployed in several ways.  The most straightforward, deploying hosts to a public VLAN with a public IP directly connected to its primary interface, is discouraged for several reasons:

  • There is no load-balancing / redundancy built in when reaching the IP
  • They are directly exposed to the Internet, and thus have fewer safeguards if a miss-config or bug is introduced to their firewall rules or host services
  • IPv4 space is scarce, and pre-allocating large public subnets to VLANs is difficult to do without causing unnecessary IP waste.

Where services need to be made available to the internet they should ideally sit behind a load-balancer, or expose the service IP with another technique (BGP etc.).  Where hosts need outbound web access they should use our HTTP proxies where possible.

Public VLANs should be used only if there is no other option (for example if a service cannot sit behind a load-balancer, or needs external access that cannot be done any other way).  The below diagram can help figure out if we're indeed in such special case.

Additionally we should strive to migrate services away from public VLANs if the requirement or dependency is not valid anymore, or can be satisfied in a different way.

Flow chart to help a user pick an appropriate IP type for production.

IPv6

Except special cases, servers MUST be dual stacked (have both a v4 and v6 IP on their primary interface)

This aligns with the longer-term goal of depreciating IPv4 and eventually only having one protocol to configure.

Congestion

Cross DC traffic flows SHOULD be capped at 5Gbps

Cluster traffic exchanges within a DC SHOULD NOT exceed 30Gbps

The network is a shared resource, while we're working at increasing backbone capacity (hardware/links) and safeguards (QoS) we all need to be careful about large data transfers.

If planning a service that is expected to consume a lot of bandwidth please discuss with Netops to ensure optimal placement and configuration of the network. It is extremely important that we don't introduce new services which may end up negatively impacting the overall network or existing applications.

See also

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

About multiple server uplinks: https://blog.ipspace.net/2023/05/failure-detection-server-dual-homing.html