Network QoS

Quality of Service is a term in networking which refers to a number of techniques that can be used to profile and, in some circumstances, prioritize particular traffic flows over others.

As communications systems moved to packet-switching, and ultimately all-IP networks carrying multiple types of traffic, that at once stage would have each had their own dedicated physical components (i.e. voice, video, storage, control), it has become and important consideration for network operators.

QoS functions in network devices define how packets should be queued in internal buffer structures, and scheduled for transmission on the wire. It is important to note that for the most part QoS rules do nothing. They only get used when a device or link is under strain, and lacks sufficient bandwidth for all the data that wants to use it. A good way to think about QoS configuration is telling routers "what traffic to drop" when it *has* to drop something. Viewed in this way it's clear that the better solution is to build and operate networks that don't drop any packets. The simpler (and usually cheaper) option to deal with congestion is to provision more bandwidth.

Configuring QoS still makes sense, however, to deal with exceptional circumstances that might arise from time to time. That could be due to irregular traffic flows (sudden changes in application behavior, fault scenarios that reduce capacity, or potentially even denial-of-service attacks). In such scenarios where packet loss cannot be avoided, it helps if the network can make intelligent decisions about what traffic is least important.

QoS in Wikimedia

Traditionally, WMF network devices have no specific QoS configuration applied. All traffic is considered "best effort", and during congestion all traffic flows are equally liable to suffer from drops. By and large this has worked well (TCP and other higher layer protocols help to balance flows). One element that has helped is the relatively low speed that servers are connected at (typically 1G), which acts as a natural limit to how much traffic a single (greedy/malfunctioning) server can send. As we are now connecting more servers at 10G and even 25G, there is increased potential for a handful of servers to generate traffic flows that swamp the core network.

As discussed, the best way to accommodate such flows is to make sure we have sufficient bandwidth throughout the network. But there is something of a chicken-and-egg element here. It doesn't make sense to deploy a lot of additional bandwidth in case applications come along that require it. Likewise it represents a risk to deploy a lot of high-bandwidth servers, with potential to generate a lot of traffic, knowing that there are bottlenecks and high contention in certain areas of the network.

To address this gap SRE is rolling our QoS configuration to our network devices. The goal is to allow us to connect servers at higher speeds, supporting continued growth and consolidation of compute and storage, while remaining confident that mission-critical services won't be starved of bandwidth.

QoS Classes

The first requirement when implementing a QoS framework is deciding how many traffic classes should be created. Obviously the more classes one has the more finely-grained policies about what to do can be, however it comes at the cost of complexity. Netops are of the opinion that a relatively simple approach, with just a few classes, is the best option.

As such the following classes will be defined on network devices:

Class name	DSCP Marking	DSCP Decimal	Scheduling BW%	Description
Management & Control	CS6	48	5%	Network control traffic (i.e. routing protocols), and critical management services.
High	AF21	18	35%	High priority traffic
Normal	DE	0	50%	Default priority - same as existing single class
Low	AF41	34	10%	Low priority "scavenger" class

Code-points AF11 is reserved for possible future use if a 'higher than high' traffic priority is deemed necessary.

Above codepoint names are from the Diffserv standard. DE is also commonly refereed to as CS0, BE (Best Effort) and DF (DeFault). This table is a good reference to the various possible markings.

Scheduling Bandwidth

The "scheduling bandwidth" represents the minimum percentage of available link bandwidth that will be dedicated to that class when a link is under saturation. In our setup the "high priority" queue will get 35% of available bandwidth in such a scenario, despite the fact that only a small minority of all application flows will be mapped to it. The majority of our traffic will remain classified as "normal", and contend for the 50% of bandwidth available to it. Finally the "low" priority class gets the remaining 10%, to keep some data flowing within it, but it will suffer most due to the congestion.

All classes will be served by a weighted round-robin scheduler based on their defined scheduling bandwidth. No "expedited" (priority/strict) class is defined, meaning no queue will be configured such that it will be served immediately if any packets arrive on it. Such priority queuing is commonly used for real-time voice and video applications, where absolutely lowest latency and jitter (std. deviation) is required. While we may have high-priority flows, they are data flows and not real-time communications, so standard, non-expedited queuing is preferable.

It is also worth noting that the percentages simply reflect the scheduling priority. When links are not saturated, any class, including 'low', can use 100% of the available bandwidth.

DSCP Marking

Trusted vs Untrusted Interfaces

Any QoS design is a network-wide undertaking. A key concept involved is the idea of "trusted" and "untrusted" interfaces.

The basic idea here is that where traffic arrives in from an external source you can't "trust" the TOS/DSCP marking in the IP header. On these interfaces you need to:

Map traffic to forwarding classes based on some criteria other than the DSCP bits in the header.
Write the DSCP bits in the header to those you are using elsewhere on the network to represent that traffic class.

In the Wikimedia setup external internet-facing interfaces are clearly "untrusted" based on that. Server-facing interfaces on our switches are on the other hand considered "trusted", as we are going to control and set DSCP bits egress from our servers using netfilter. Extending the metaphor slightly we don't "trust" any DSCP bits third-party software might set "out of the box" on our servers. Unless we explicitly set the DSCP in a given packet to one of our defined codepoints we therefore need to mark packets leaving our servers as DE, to map it into the 'normal priority' forwarding class on the network.

DSCP Marking

The plan in Wikimedia is to set DSCP values on end servers, leveraging our existing iptables (Ferm) / nftables configuration frameworks. The network devices will be configured to trust the incoming values set on servers, and queue packets accordingly. Puppet will be used to drive the end-host configuration for packet marking.

Various schemes have been proposed for the use of the TOS/DSCP fields over the years, but ultimately there is no universal standard and these markings are generally ignored or rewritten across the internet. This means they only have significance internally for any organization, and merely serve to identify traffic classes based on local policy. As such any markings work as good as any other, as long as all devices are configured the same. Adherence to any particular scheme is not required. For the most part the code points defined in RFC2597 are used here, but the categories they represent are those we define internally, and don't necessarily correspond to how they are defined in the RFC or by any specific network vendors.

Traffic Classes

As shown in the table 4 classes of traffic are defined, which are detailed further below.

Management & Control

This class is used for management and control plane traffic. It is vital, in the presence of congestion, that such traffic is prioritized to ensure that devices remain reachable via SSH, monitoring continues to work, and router to router control plane (i.e. OSPF, BGP etc.) traffic is served. This ensures that engineers and other systems can continue to work in such a situation, allowing the root cause to be identified and addressed.

High Priority

This class will be used for high-priority application flows as required. It has less scheduling bandwidth than the 'normal' class, but much fewer traffic flows are expected to be mapped into it, giving them a relatively higher weighting. Exactly what traffic should be mapped into it needs to be carefully considered, and discussed with the SRE teams responsible for the relevant applications. Typically only low-throughput, sensitive traffic flows should be mapped to this class. High-throughput bulk data transfers should not be mapped to this class.

While it might look attractive for any given flow to be declared 'high priority', it is easy to negate the usefulness of the category if too much many things are mapped to it (i.e. if everything is important, nothing is).

Normal Priority

This is the standard class into which all normal application flows are mapped. It can be thought of as the equivalent of our existing, single traffic class across the network. With the possible exception of some management/control traffic, the base server configuration will map all traffic into this class.

Low Priority

This is a "scavenger" class that can be used to map flows that have below-normal priority. Similar to the 'high priority' class we need to carefully consider what should be mapped into it. Unlike the 'high priority' class there is no real danger (on the network side), of marking too much traffic as low priority. So we can be a little more trusting of packets marked like this.

Teams are as unlikely to declare their applications are low priority as it is likely for them to say they are high priority. But it may serve an important function. A good example of what to place in this class is bulk data transfers, such as backups or storage replication traffic. Such traffic will often use "as much bandwidth as it can get". We can map it into the low priority class to prevent it from starving other, more urgent flows, of bandwidth. But we can confidently connect hosts running such services on high-speed links, they will benefit from being able to use all the spare bandwidth the network can provide.

This class may also be useful for lowering the priority of traffic serving certain requests, if we can mark those appropriately on the hosts. This may be useful to de-pref scrapers without blocking them entirely or setting any particular rate-limit.

Puppet

To enable us to place traffic into QoS classes the existing puppet resources for firewall::service and firewall::client have been extended to allow for a QoS classification to be added. In both cases a new optional parameter, 'qos' will be available. This can be set to low/normal/high as needed to mark matching packets with the correct DSCP values so the network will place them in the right forwarding-class.

For example if we wished to place redis replication traffic in the low priority / scavenger class we could add the qos parameter to the existing firewall::service definition:

   firewall::service { 'redis_replication':
       proto  => 'tcp',
       port   => $redis_port,
       srange => $redis_replicas,
       qos    => 'low',
   }

We need to be mindful, however, that the service definition only applies to the machine running the service. And thus the above addition would only map reply traffic that server sends from the $redis_port to clients into the low priority class. For correct operation we should ensure both sides of conversations map to the same class, so we would also need a firewall::client resource added on hosts which make outpound connections to $redis_port. For instance:

   firewall::client { 'redis_replication':
       proto       => 'tcp',
       port        => $redis_port,
       drange      => $redis_replicas,
       qos         => 'low',
   }

It should be noted that many firewall::service definitions already exist in puppet. These are needed as our default firewall policy for packets in the INPUT chain is DROP. So to allow traffic in for any service (redis, ssh, http etc) we need to explicitly permit it in a firewall::service definition. Because our default policy for the OUTPUT chain is ACCEPT, the same is not typically true for clients making requests.

This means that while it should be very easy to add the 'qos' parameter to existing firewall::service definitions, we also need to add completely new firewall::client resource definitions in any roles that need them, to ensure we mark traffic on both sides of the connection.

More Complex Configurations

In some cases the above simple model won't be sufficient to classify a particular type of traffic. For instance if we need to match on more criteria than just the UDP/TCP ports of a service. In these cases we can add a generic `nftables::file` resource with the specific nftables rules, for example:

   nftables::file { 'scp_sftp_low':
       ensure  => present,
       chain   => 'postrouting',
       content => file('profile/firewall/scp_sftp_low.nft'),
   }

We would then define rules as needed in file `modules/profile/files/firewall/scp_sftp_low.nft`:

   # Managed by puppet
   # Match ssh packets marked with TOS 0x02 by openssh (from scp/sftp subsystem) and remark as af41 (low priority)
   tcp sport 22 ip dscp 0x02 ip dscp set af41 return
   tcp sport 22 ip6 dscp 0x02 ip6 dscp set af41 return
   tcp dport 22 ip dscp 0x02 ip dscp set af41 return
   tcp dport 22 ip6 dscp 0x02 ip6 dscp set af41 return

If a given system is not yet migrated to nftables we can add similar rules using the `ferm::rule` resource:

   ferm::rule { 'dscp-icmp-mon':
       table => 'mangle',
       chain => 'POSTROUTING',
       rule  => 'proto tcp sport ssh mod dscp dscp 0x02 DSCP set-dscp-class AF41; RETURN;'
   }

Again netops are available to work with teams on creating the most appropriate rules for a given service.

Guidelines

In general the 'high' priority class should be used for low-bandwidth, latency sensitive, important traffic. So for instance keepalive traffic, or traffic that is essential to monitoring the status of a cluster of nodes.

High bandwidth, bulk traffic, such as file transfers, backups, bulk data sync etc. is not suitable for being marked as 'high priority'. This traffic is indeed important, but it is too voluminous to place into the high priority queue should we run into congestion. It needs to be remembered that the goal is never to hit congestion, and thus for these rules not to matter outside exceptional circumstances. In those circumstances we want to "keep the lights on", prioritize our own control traffic and a small amount of high priority application flows, but inevitably things break when the network is congested and dropping packets (with or without QoS).

Juniper Class-of-Service

QoS is implement using the "class of service" configuration in JunOS (a slightly older industry term for the same thing). If and when other vendors are introduced a mechanism to integrate them will need to be found. The relatively simple design should help extend it to other vendors if they also support QoS functionality.

The Juniper CoS framework, like most implementations, can be complex, involving multiple related configuration elements. Within that context we have, however, done as much as possible to keep the configuration simple and understandable.

The overall Juniper framework is outlined below.

Queues and Forwarding-Classes

At the base level the hardware places packets in memory (buffer) structures as they arrive into a system. These structures are dynamically partitioned into multiple numbered queues, and packets are picked from these queues to be transmitted, or dropped, based on a scheduling configuration.

Juniper introduce an abstraction called "Forwarding Classes", and give us control over what packets get placed into what forwarding class through the use of classifiers. In turn we define what forwarding classes map to what actual system queues. In theory this allows multiple forwarding classes to map to a single queue, however in our configuration we maintain a 1:1 mapping of forwarding class to queue. In other words 1 queue is used for all the traffic from a given forwarding class, and no queue has more than 1 forwarding class using it.

We defined 4 forwarding classes in the Juniper configuration and map them to queues as follows:

Forwarding Class to Queue Mapping
Forwarding-Class	Queue (MX Routers)	Queue (QFX/EX Switches)
LOW_PRIO	0	0
NORMAL_PRIO	1	3
HIGH_PRIO	2	4
MGMT_CONTROL	3	7

The queue numbers used for the classes differs per platform to keep the default system queue number in place. If user-defined forwarding classes are mapped to non default queue numbers, the device will create these in addition to the defaults, which remain. By using the default numbering we keep the total number of queues to 4, which ensures maximum compatibility across platforms.

This also matches the Juniper default setup, in which two forwarding-classes, and two queues, are used:

Classifiers

Packets arriving in to a device need to be mapped to a given forwarding class for transmission. This function is known as classification. In general JunOS provides 3 ways to do this:

Method	Description
Default Classifier	A default classifier maps all incoming traffic on a given interface to a single forwarding class.
DSCP Classifier	A DSCP classifier can be used to map packets to forwarding classes based on the value of the DSCP bits in each packet on arrival. This is used in the Wikimedia design on our "trusted" interfaces where we know the DSCP markings have already been set by our policy.
Firewall Filter	For the maximum flexibility the firewall filter / ACL functions on a device can be used to map traffic to forwarding classes. This can be used to map packets based on the typical criteria (src/dst networks, port numbers, protocol) in any ACL. JunOS also allows for the re-writing of DSCP bits in a firewall action. This is used in the Wikimedia design on our external, internet-facing interfaces.

DSCP Classifiers

DSCP classifiers can be configured for either IPv4 or IPv6. Rather confusingly on some platforms both can be defined and applied to an interface, whereas on others you can only configure one or other type to an interface. Where it is not possible to add both an IPv4 and IPv6 classifier to an interface an IPv4 classifier should be used, and the rules it defines will be applied to both.

In either case we classify traffic into the same forwarding class based on the same DSCP bits regardless of the address family. We define a classifier of both kinds on all devices, however, as in some cases we need to configure both:

classifiers {
    dscp V4-CLASSIFIER {
        forwarding-class HIGH_PRIO {
            loss-priority low code-points 010010;
        }
        forwarding-class LOW_PRIO {
            loss-priority high code-points 100010;
        }
        forwarding-class MGMT_CONTROL {
            loss-priority low code-points 110000;
        }
        forwarding-class NORMAL_PRIO {
            loss-priority high code-points 000000;
        }
    }
    dscp-ipv6 V6-CLASSIFIER {
        forwarding-class HIGH_PRIO {
            loss-priority low code-points 010010;
        }
        forwarding-class LOW_PRIO {
            loss-priority high code-points 100010;
        }
        forwarding-class MGMT_CONTROL {
            loss-priority low code-points 110000;
        }
        forwarding-class NORMAL_PRIO {
            loss-priority high code-points 000000;
        }
    }
}

Interface Classifier Configuration

As discussed we need to apply one or both of these classifiers to interfaces in slightly different ways depending on the type of interface and platform. All are applied in the 'class-of-service interface {' context. The table below lists the way they need to be defined.

Platform	Interface Type	Classifier Config	Classifier location
QFX5100 and EX4300 Series Switches	L2 port facing server	Only IPv4 DSCP classifier	Unit 0 of L2 port
QFX5100 and EX4300 Series Switches	Routed or Routed Sub-int Parent	Only IPv4 DSCP classifier	Physical int itself rather than any particular unit
QFX5120 Series Switches	L2 port facing server	IPv4 and IPv6 DSCP classifiers	Unit 0 of L2 port
QFX5120 Series Switches	Routed or Routed Sub-int Parent	Only IPv4 DSCP classifier	Physical int itself rather than any particular unit
MX Series Routers	Routed L3 interface with no vlan encap / sub-interfaces	IPv4 and IPv6 DSCP classifiers	Unit 0 of routed port
MX Series Routers	Routed sub-interface	IPv4 and IPv6 DSCP classifiers	Configured for every unit under the parent interface

On all platforms no classifier is applied to individual interfaces that are part of a LAG / ae bundle. The same rules as shown above apply for ae interfaces, and get applied to the ae interface itself which affects all member ports.

Schedulers

A scheduler is a config definition which defines scheduling parameters to be applied to certain traffic on egress from the system. In our design we configure a single scheduler for each of our 4 defined forwarding-classes, using a simple 1:1 mapping. Each scheduler gets a "transmit-rate" configured, as well as a buffer-size (both defined as percentages). These schedulers ultimately implement the weighting described above under Scheduling Bandwidth.

The default schedulers to be configured on all our Juniper devices are as shown, 4 are configured matching the 4 forwarding classes we use:

    schedulers {
        SCHED_HIGH {
            transmit-rate percent 35;
            buffer-size percent 35;
        }
        SCHED_LOW {
            transmit-rate percent 10;
            buffer-size percent 10;
        }
        SCHED_MGMT {
            transmit-rate percent 5;
            buffer-size percent 5;
        }
        SCHED_NORMAL {
            transmit-rate percent 50;
            buffer-size percent 50;
        }
    }

It should be noted that these represent the minimum allocation for bandwidth and buffer space each forwarding class gets allocated under congestion. Any class can utilize up to 100% of the available bandwidth or buffer if there are no packets from other classes that also need it.

Scheduler Maps

Each defined scheduler is an independent element which does nothing on its own. Scheduler-maps are used to bring everything together. A scheduler map references one or more forwarding-classes, and associates each with a previously defined scheduler. Scheduler maps are then associated with interfaces under the 'class-of-service' config to define the egress QoS behaviour for a particular port.

In the Wikimedia design only one scheduler map is defined, which is applied to all interfaces. Scheduler maps only relate to outbound traffic, so they get applied equally to trusted and untrusted interfaces. The default scheduler-map in the config is as follows:

scheduler-maps {
    WMF-MAP {
        forwarding-class HIGH_PRIO scheduler SCHED_HIGH;
        forwarding-class LOW_PRIO scheduler SCHED_LOW;
        forwarding-class MGMT_CONTROL scheduler SCHED_MGMT;
        forwarding-class NORMAL_PRIO scheduler SCHED_NORMAL;
    }
}

Outbound Shapers

As seen in the previous section our schedulers use percentages to define minimum bandwidth for forwarding classes. On a given interface JunOS will calculate those rates as a percentage of its line rate (i.e. 1/10/25/40/100G etc).

For the most part that is desirable, however occasionally we utilize sub-rated Ethernet services as transport links between our sites. In these cases, where the actual bandwidth available to us over a link is lower than the line rate, we need to configure an outbound shaper to controll the maximum rate. This is configured for an interface using the 'shaping-rate' command under the interface in the 'class-of-service' config.

Interface Config

Bringing it all together below you can see the class-of-service configuration for various types of interface.

Routed L3 port on an MX router connecting a sub-rated transport circuit:

class-of-service {
    interfaces {
        xe-0/1/0 {
            scheduler-map WMF-MAP;
            shaping-rate 3920000k;
            unit 0 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
        }
    }
}

Routed ae port on an MX router facing L2 switch stack:

class-of-service {
    interfaces {
        ae1 {
            scheduler-map WMF-MAP;
            unit 401 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
            unit 510 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
            unit 520 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
            unit 530 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
        }
    }
}

L2 port connecting to a server on a QFX5100 or EX switch:

class-of-service {
    interfaces {
        ge-1/0/0 {
            scheduler-map WMF-MAP;
            unit 0 {
                classifiers {
                    dscp V4-CLASSIFIER;
                }
            }
        }
    }
}

L2 port connecting to a server on a QFX5120 switch:

class-of-service {
    interfaces {
        ge-0/0/40 {
            scheduler-map WMF-MAP;
            unit 0 {
                classifiers {
                    dscp V4-CLASSIFIER;
                    dscp-ipv6 V6-CLASSIFIER;
                }
            }
        }
    }
}

There are other variants as described in the table previously but the above should give a taste of the configuration.