SRE/business case/Network - 4th transit for drmrs

From Wikitech

1. Executive Summary

Wikimedia’s POPs mission is to reduce latency for users across the globe, as lower latency correlates with user experience and in extreme cases the ability for users to even be able to load Wikipedia. In other words, users rely on us striving for this low latency both in normal times and exceptional times (outages, maintenance, etc).

For normal times, we’re all set, with 3 transits and 2 peerings ports slightly above our other POPs standards and give us a theoretical 50Gbps capacity.

However about 50% of the user traffic toward Wikimedia (~16Gbps, see below) goes through esams, which means that it’s more than multiple sites combined. To be able to handle this load, esams relies on 6 transit links (more than needed but all are donations) and 1 peering presence (at one of the largest exchange points in the world), totalling 90 Gbps or theoretical capacity.

2. Business Problem

We performed a large scale test last quarter (see drmrs: initial geodns configuration) which revealed a limitation in terms of transit capacity for drmrs. We've determined that we need a traffic shifting to prevent one of the links from saturating, which would cause users from the EMEA region to be redirected to eqiad when esams is offline. Such a redirect of EMEA users across the Atlantic would result in terrible user experience, especially for users with low quality internet access.

3. Problem Analysis

Our monitoring gives us a 5min granularity, so it means that a spike of traffic lasting less than 5 min could be smoothed out over that 5min period and thus doesn’t show the real link usage. We also need to keep in mind that down on the cable, we’re either using the link at 0% or 100%. In other words, users will see problems well before a link is at 100% usage in our monitoring. This is why we have paging alerting configured when a link reaches 80% usage.To add to it, esams is also the POP most targeted by DDoS attacks, from which one of the best defenses is to have extra capacity headroom.

The 2nd key point to keep in mind when doing capacity planning is that traffic will never use all links equally. Multiple factors out of our control come into play, the largest one being how well a given transit is connected to (users through) ISPs. Furthermore those factors change constantly as agreements between providers change and hardware comes (new links) and go (hardware failures). We do have some control to steer traffic to a link or another, but those come with 3 main downsides: The forced path will be sub-optimal in term of latency The forced path configured for a specific event might not be relevant the next time a similar event happens The forced path requires human configuration and quickly becomes toil (cat and mouse game)

The last key point is that the more transit or peering providers we’re connected to, the more redundancy it brings (to lesser extent for peering as they have less reach: they can’t reach the whole Internet). Even without a perfect traffic distribution, the impact of 1 provider out of 4 miss-behaving will be theoretically lesser than 1 out of 3. In a similar vein, without influencing traffic, more providers means more possibilities of shorter paths between us and the users (thus better latency).

One of drmrs’ goals is to be able to handle all of esams + drmrs load. No users from the EMEA region should be redirected to eqiad when esams is offline for any reason and long time periods. Redirecting EMEA users across the Atlantic would result in terrible user experience, especially for users with low quality internet access. We performed a large scale test last quarter (see drmrs: initial geodns configuration) which revealed a limitation in terms of transit capacity for drmrs. Traffic shifting was needed to prevent one of the links from saturating, and even while doing so, we could not afford losing one of our 3 transit links, putting us in a precarious situation. As seen previously, traffic shifting is not a sustainable solution.

4. Current Technology/Solution

5. Available Options

5.1 Option 1 - elevate IXPs usage

5.1.1 Description

Increase our IXPs usage, as it’s usually the most stable but less used link as its reach is smaller. In concrete terms it means sending peering requests emails to other peers present at the same exchange, then, after a positive reply from their side, manually configuring the BGP sessions. In the current state of our infrastructure, this is a lot of manual toil, and the benefits are limited by the number and types of peers (we especially seek ISPs). Until we have a proper peering automation the ratio benefit to cost (engineering time) is not favourable.

5.2 Option 2 - add a 4th transit provider

5.2.1 Description

Thanks to DCops’ new transit providers is a streamlined process, reducing the time and engineering time required for a roll out, which helps keep a low implementation cost. About the monthly recurring cost, Europe is one of the cheapest regions in terms of bandwidth. Going with this option will ensure that drmrs can handle all of EMEA’s load, reduce alert fatigue (less risk of a link saturating), eliminate toil (no traffic engineering), improve user experience by increasing redundancy and lowering latency.

Other than the estimated cost of 12000usd for the first year (with a decreasing cost each following year), this option has no downsides and requires the involvement of the DCops team.

Standard process will apply, provider selection will be done during Q1, for a Q2 implementation. This process could be delayed by a quarter if a legal review is needed which usually only applies to new providers. Required characteristics of that new link will be: 10Gbps fiber, 1Gbps burstable commit, 12 months terms, provider present in the same facility. Possible candidates are TI, GTT, NTT, Telefonica, Lumen but this list is not exhaustive.

Once implemented, we will perform a depool of esams to verify that no link saturates, and an additional preventing deactivating of the most busy link to make sure no saturation happens in this use case as well. If not successful, several options are possible from traffic engineering to changing transit providers (older or newer ones).

5.1.3 Costs and Funding Plan

5.1.4 Risks

5.1.5 Issues

6. Recommended Option

7. Implementation Approach

7.1 Dependencies

7.2 Rough estimates