When you specify a Pod you should specify how much of each resource each container needs. The resource requests are used by the kube-scheduler to decide which node to place the Pod on. The kubelet on node also reserves the requested resources specifically for that container to use. When specifying limits for a container, the kubelet will enforce those limits on the container so it can not use more of the resource the limit was set on.

If the node where a Pod is running has enough of a resource available, it’s possible (and allowed) for a container to use more resources than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.

How Memory requests and limits are applied

Memory Requests and Limits are relatively straightforward: the former indicates the minimum amount of memory available for a container, and the latter indicates the maximum. With only Requests set, a container is free to cross the minimum threshold to reach any value that the underlying worker node allows. With Limits set, the cgroup used by the container will enforce a maximum limit that triggers an OOM event if crossed (processes are likely being killed by the Kernel).

How CPU requests and limits are applied

Let’s start with saying that any value that you may set for Requests or Limits is an approximation of what it means to get CPU cores assigned to a specific process. Both Requests and Limits have the same unit of measure, the millicore, that represents a percentage of a core (one core being 1000m or 1). For example, 100m means 1/10th or a core (100/1000), and 2000m represents two full cores (2000/1000). Kubernetes and the Container Runtime try to map this intuitive and easy to understand idea to tools that the Linux CFS Scheduler offers, ending up with an approximation of what the user requested, not a 1:1 correspondence. This means that the user needs to pay attention when setting these values, since they may or may not represent what in reality a process in a cgroup will get.

CPU requests are usually (depending on the Container Runtime) defined as weighting; for example, in Docker this is done using the --cpu-shares flag. It is important to note here that the CPU shares are expressed as a proportion of CPU cycles available on a machine, and that they will only be applied in high-CPU scenarios (e.g. “all” containers consuming lots of CPU at the same time). When containers are idle, the other containers running are free to use left over CPU time/shares (see Docker Docs | CPU share constraint for more details).

Kubernetes by default maps one core to 1024 shares, and sets its max capacity for a worker node to the number of available cores multiplied by 1024 (unless instructed/configured differently). For example, let’s say that we have three containers with 1024, 2048 and 4096 shares each, all of them doing constant CPU time, and a grand total of 4 cores available on the worker node. If you sum up all the shares and divide by 1024, you’ll get 7, but the system doesn’t offer that many cores. So CFS schedules processes on the available 4 cores proportionally to the shares, that may be different to what the user intuitively expected. The main limitation of this CFS feature is that there is no control over the maximum amount of resources assigned to a container, and this is why CPU Limits were created. Please note that the CPU shares are used only during scheduling (to select what node can host a pod) if Limits are not set.

CPU limits are enforced using cgroups CFS bandwidth control, which works by providing the maximum amount of CPU time that a container can get within a given period (usually 100ms, split in 5 ms slices). Quota slices are assigned to per-cpu run queues (one for each core available) and unused quota is returned to the global pool (deducting 1ms of “slack time”). In this case, a millicore represents what part of those 100ms the container is allowed to use: 100m represents 10ms, 2000m represents 200ms (100ms for each core/run-queue). This means that a task that usually takes 40ms to complete will take 310ms when the process is limited to 100m (or 10ms) CPU:

It runs for 10ms (first 100ms period, two 5ms slices).
It gets throttled for 90ms.
It runs for 10ms (second 100ms period, two 5ms slices)).
It gets throttled for 90ms.
It runs for 10ms (third 100ms period, two 5ms slices)).
It gets throttled for 90ms.
It runs for 10ms (fourth 100ms period, two 5ms slices)).

The above assumes that only one process/thread runs, but with multiple threads this picture might get worse. If the task in this example runs in 4 threads, only two of them will be able to run per period (10ms quota split in two 5ms slices) effectively throttling two of the threads for each complete period. This means that setting limits may not be straightforward when multiple threads are running and requesting CPU at the same time (for example, most of the Golang processes nowadays show this behavior). As suggested down below as well, the key aspect is to know how many threads a certain container will run, if they’ll all need the CPU at the same time and if they can be limited to a maximum number.

The worst case scenario is that you’ll observe throttling in container’s metrics (see this dashboard for example), namely the amount of time that the container’s threads were waiting to run on CPU. A small amount (few ms) may be bad for latency sensitive HTTP servers, but maybe a larger amount (tens of ms) can be tolerable for Kubernetes controllers, it all depends on the requirements of the running container.

Setting requests

Requests should be set according to the expected requirement of the container under usual load. “What you generally need” so to say. Benchmark your application for example like described in User:Alexandros Kosiaris/Benchmarking kubernetes apps - Wikitech to get an understanding of what to request. Keep in mind that your container is potentially able to consume more than the requested resources but is only guaranteed what it requests.

Please ensure that you review and adjust the resource requests you’ve configured after the service has gone live if they do not align with the actual resource requirements. It might be worth it updating your benchmark setup as well to better reflect reality.

Setting limits

We (SRE) generally require containers to set limits. For memory this is more straightforward and easier to reason about: Set your memory limit to what you consider “out of line” for your service. Your container will potentially get OOM killed (and restarted) when it crosses that threshold.

For CPU the situation is more complex due to how the limits are applied. You will potentially also need to configure your application (thread pools for example) properly to not be throttled. Generally speaking limits will guarantee you predictable performance (e.g. saying “a container with these limits can sustain workload of X”), allowing you to scale with increasing the number of replicas. We found that for latency sensitive applications as well as (non cpu-bound) workload with high concurrency setting limits does not work well and leads to lots of throttling. This is mainly due to the effect described above that makes multi-threaded applications burn through their quota very quickly.

Best practice

MediaWiki php-fpm

We witnessed high tail latency for mediawiki-on-kubernetes deployments at some point (mainly from mw-api) that were the consequence of the php-fpm containers being throttled a lot. To overcome that, we removed the CPU limits from those containers and relied on the fact that php-fpm is running with a fixed number of worker threads, so it can not consume more CPUs than it has workers (e.g. we practically CPU limit the container by php-fpm configuration).

For MediaWiki, this is particularly acceptable, as we consider it our main application and expect it to utilize available headroom resources effectively.

envoy

By default, Envoy creates a number of worker threads equal to the number of CPUs available on the worker node. As described above this is problematic with setting limits as it can very quickly lead to massive throttling on nodes with a high number of CPUs and the relatively low CPU limits we usually have for envoy (as it does not actually use much CPU).

Istio solves this problem by setting the envoy concurrency according to the CPU limit of the container (max(ceil(<cpu-limit-in-whole-cpus>), 2)). For MediaWiki deployments we currently remove the limits and set concurrency to 12 (see ⚓ T344814 mw-on-k8s tls-proxy container CPU throttling at low average load) but ultimately a generic solution would be desirable.

NodeJS service-runner

NodeJS services based on WMF's service-runner should be run with num_workers: 0 .

service-runner uses NodeJS Clustering to farm out work to multiple processes. When used as part of WMF's service-template-node, worker processes will handle processing incoming http requests.

Use of service-runner and NodeJS clustering was more important when we ran services on bare metal. It was work was parallelized. Now that we run services in k8s, it is better to use deployment replicas to parallelize work. Setting num_workers: 0 will cause service-runner to run handle its 'master' and 'worker' load in a single process. This setting proved to be especially better for services that use node-rdkafka and set UV_THREADPOOL_SIZE to increase Kafka client throughput. The more threads a process has, the more likely it is that the the CFS allocates CPU time in a less-than-completely-fair way, which is more likely to lead to CPU throttling.

A possible alternate solution to avoid too-many-threads caused throttling would have been to set a large UV_THREADPOOL_SIZE on worker processes only. This would have resulted in approximately the same total number of threads, even if num_workers was more than 0.

Knative Serving

The ML team has been struggling a lot with several Go-based Kubernetes controller containers showing throttling all the time (in the tens of ms). One of them, the Knative Serving’s webhook, showed an avg of tens of ms of constant cpu throttling, and we tried to understand why. The container is a golang process with a ton of threads running at the same time, 60 (gathered via `ps -eLf | wc etc..` on a ml-serve worker node) and the CPU limit was 1000m. As an approximation/abstraction (even if it is more complicated than this), let’s calculate how many slices we have to use: 1000m / 5 ms = 20 slices. So the webhook’s threads have “only” a maximum of 20 CPU run queues to run on, and we can definitely see that if more than 20 of 60 threads available run at the same time, the scheduler will not be able to run all of them. Moreover, the more threads there are, the less slices are available to be able to run on CPU. A 5ms slice may be enough for a thread, but may not be enough for others, and the remaining time needed will incur for sure in throttling. In this case, there are multiple roads:

Remove Limits: this is a K8s controller, we don’t want throttling since it is part of the control plane, so we don’t add ceilings to CPU usage. Effective but risky, since we may start relying on extra capacity provided by the scheduler that may or may not be there at any time.
Increase Limits: we reach a point where there is a good compromise between millicores assigned and CPU throttling. The compromise can be reviewed and adjusted over time, and the admin/user can assess periodically how the compromise impacts other key metrics.
Tune GOMAXPROCS according to the Limits (see also https://github.com/uber-go/automaxprocs/tree/master for alternatives).

There is no perfect solution, but the use case is indicative that sometimes striving for zero throttling is not needed or desirable.

External resources