Jump to content

Kubernetes/Capacity Planning of a Service

From Wikitech

Introduction

Wikipedia defines capacity planning as "the process of determining the production capacity needed by an organization to meet changing demands for its products". For Information technology specifically, it adds that the term means "capacity planning involves estimating the storage, computer hardware, software and connection infrastructure resources required over some future period of time.". This is the broad context we will be working with.

Background

WikiKube, as a Kubernetes cluster in the WMF, has a clear stated goal of serving Production MediaWiki and related micro services traffic, with a very clear exclusion for Stateful applications/Data stores. This is because the services in WikiKube are thought to, at least partially, conform to the 12 factor app paradigm, a set of guidelines (named factors) that allow, among other benefits, the services to scale without significant changes to tooling, architecture or practices. The way this happens in practice, is by altering the amount of instances of the service to match the current needs, without needing much more than a simple change to define the amount of instances. Stateful applications (of which data stores are a subset), require more tooling, architectural changes (or practices) to achieve this, as coordination between the instances via some mechanism (internal or external) is required.

If you want to see how Capacity Planning happens for a data store, the Cassandra/CapacityPlanning page might be useful.

Thus, we will not be covering capacity planning best practices for stateful services (again of which data stores are a subset) here. We will be discussing applications that implement important business logic, each instance independently of the other ones.

High level overview

In the next sections we will be going through a methodology that allows to:

1. Understand the capacity limits of a single instance of the application

2. Properly size that single instance for Kubernetes, in order to empower the scheduler to achieve the best possible outcomes for instance placement

3. Craft an estimation of the amount of instances that will be required

4. Codify it in our deployment-charts repo

we will be focusing on HTTP based services only. The general ideas can be applied to non HTTP based applications too, but the tooling and examples are going to be focusing on HTTP.

Methodology

Discover the capacity limits of a single instance

Let us first focus on figuring out the capacity of a single instance of the application. For this to happen we need to:

  • Know what dependencies exist, i.e. which services is the application going to be calling out to
  • Utilize a process to benchmark the app in order to figure out:
    • the maximum amount of requests before that single instance ceases to be able to respond successfully and starts failing
    • the amount of requests before the response times of the application climb to unacceptable levels. It might be this number matches the one above, or it might be sufficiently different. If it does match, it simplifies things a bit. But that is about it
  • Record the CPU and Memory utilization of the application when reaching that maximum
The astute reader may have figured out that the intent of the above process is to saturate the application. It might be helpful to also become familiar with "The Four Golden Signals" approach See https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals

Dependencies

Dependencies of the application under test need to either be mocked out, or if it is impossible to do so, the testing strategy needs to choose the call to the dependency that poses the least amount of latency to respond to. Note that health checks are one of the prime examples of extremely fast endpoints, but they require altering the application.

The intent here is to isolate the application as much as possible from dependencies so that performance bottlenecks of the dependencies do not show up as deficiencies of the application.

To provide a somewhat contrived example, if we have an application that parses user input for doing mathematical operations (addition, subtraction, multiplication, division, integration, derivation, etc) and the various operations themselves are being done by a dedicated endpoint in a dependent service, we either:

  • mock out, in the application, the call to the dependent service, e.g. always returning success
  • choose the fastest possible endpoint, i.e. addition/subtraction as they are faster than any of the other ones.

Benchmarking

What we are going to be doing, is benchmarking/stress testing the application, in a gradual way. The tools we are going to be using here are:

  • minikube: This should allow to have a local Kubernetes cluster. If you already have one, this step can be skipped. Note that the production WMF clusters are not set up to perform the rest of this process
  • helm: You are probably already deploying apps in wikikube using this and/or Helmfile
  • kubectl: Will be used to grab logs and metrics from the app.
  • locust This assumes you have access to python and pip.
In absence of access to python and pip, it is possible to use other tools like siege, vegetta or even apache benchmark (aka ab). It is best to avoid ab as it is old software the has hardcoded the HTTP version is uses to HTTP 1.0, which in many cases is not supported in modern software and requires special configurations to be enabled. siege should work, but note that it lacks visualizations and has a learning curve.

Setup

Fire up minikube (if you have a kubernetes environment handy already, e.g. via Docker Desktop, skip this step. Note however you will have to figure out how to get stats out of pods)

$ minikube start

Depending on your system (Windows, MacOS, Linux, docker/podman availability), it might require spinning up a VM or just installing everything in docker/podman.

It is beyond the scope of this document to provide an exhaustive documentation for settings up minikube across all platforms and combinations of supported drivers.

Install metrics-server in order to have pod metrics

$ minikube addons enable metrics-server

Wait 30 seconds or so for the addon to be downloaded and started.

Install the application you want to benchmark and get a tunnel set up

$ helm repo add wmf-stable https://helm-charts.wikimedia.org/stable/
$ helm install <name> wmf-stable/<chart-name>
$ kubectl get svc
# Note the name of your service
$ minikube service <name>

This will probably open a browser to your app. If it does not at least note the URL, you will need it later

In a dedicated terminal start up our simple CPU monitor

$ while true; do kubectl top pod --containers <pod> | ts '[%Y-%m-%d %H:%M:%S]'; sleep 2; done

Output will look like this:

[2025-09-04 18:47:31] POD NAME CPU(cores) MEMORY(bytes) [2025-09-04 18:47:31] <pod-name> <container> 2m 121Mi

This a poor person's metrics gathering. Using tools like GNU watch, tee, etc could make this better, but the TL;DR is to just run `kubectl top` every 2 seconds or so in the terminal. Feel free to improve on this!

Locustfile

Writing the proper locustfile for your app is the most interesting part in all of this. It is also highly dependent on how your app works and what endpoints it exposes.

Make sure to read https://docs.locust.io/en/stable/writing-a-locustfile.html, but for this doc we will pretend we are benchmarking Mathoid (because it is a very simple app). Our locustfile.py (it needs to be named like that) looks like this:

locustfile.py
from locust import HttpUser, task

class mathoid(HttpUser):

   @task
   def mathoid(self):
       data = {"type": "Tex", "q": "E=mc^{2}", "nospeech": "true"}
       self.client.post("/complete", data)

This is more or less as simple as it gets. It posts to an HTTP endpoint named "/complete" the known mathematical formula by Albert Einstein. For this example, we do not care about the result. For mathoid specifically, the formula itself does play a role in the amount of resources consumed, so finding one that is somewhat representative of the input we expect makes sense. But this is just an example, so we move on.

Run locust

$ locust

[2025-09-04 18:21:11,082] himbar/INFO/locust.main: Starting Locust 2.39.1

[2025-09-04 18:21:11,083] himbar/INFO/locust.main: Starting web interface at http://0.0.0.0:8089, press enter to open your default browser.

Go the web interface.

Paste in the URL that you noted earlier under the "Host" section. Enter a number of "users" (concurrent requests really) and a ramp up value that makes sense. Below there is an example

Click start. You should now see numbers increasing. An example is pasted below

Eventually, the app will collapse under the weight of all these concurrent users. Somewhere around there you can stop the benchmark. Below there is a screenshot for an example of that happening.

What can we understand from the charts above?

  1. The app was able to serve up to ~80rps at maximum. Note that this maximum was WELL before the app started emitting HTTP errors
  2. The maximum number of concurrent requests served were ~115. The reason there is a discrepancy between the 70 number and the 115, is because of the buffer in serving events nodejs (or any event based framework) has. Once that was exhausted, stuff broke down
  3. Latency increased linearly with the amount of users. The maximum RPS coincides with ~1000ms of latency for p95 and 780 for p50 percentiles. Once latency goes above those numbers bad things start to happen.
  4. Once the app started failing, mean latency dropped. That is because it is very cheap to respond with error codes. That is also why the amount of total requests skyrocketed 10x, almost all of them errors, once the app collapsed.
  5. The timestamp of when we hit our max is known to us by just hovering with the mouse. We can note it down

Concretely we now know:

  • That 1 single instance can sustain up to 80 rps. We have some leeway after that before things break down, but it is prudent to have enough instances to avoid going over that number per instance in production.
  • If p95 goes above ~1000ms, then we should prepare for trouble. We have some time probably before things break down to add capacity when running in production. So, high latency is an early sign of failure. It is prudent to set our SLO to this number.

Now let us find out the CPU and memory usage right around. Moving to the terminal running `kubectl top` in a loop, match the timestamp to the one where we have our max. We probably have some line like:

[2025-09-04 18:37:36] mathoid-mathoid-5fdb7bb68f-bwmkn mathoid-mathoid 994m 354Mi

And we know now our first set of numbers for resources. Round up for safety to 1000m and 400Mi and we are good to go. This is our maximum consumption and that is what we can put our containers limits on.

The above is pretty strict. The numbers are low enough that I would probably be adding a leeway of 25% on this numbers, so CPU: 1250m and RAM: 500Mi to avoid the limits killing the app sooner than it hit those limits. However, if we were talking about multi gigabyte and multi CPU applications, I would be more conservative, in order to conserve resources.

We need a second set of numbers, those that apply during "normal operations". This is slightly more difficult to find out. For this one we want to find the resource consumption during what will probably be the "average" case. This can be difficult to find out and we might want to rely on historical data instead (if we have them). For this example, intuitively, we probably want something around 50% of the max RPS. This gives us enough headroom to absorb spikes that are up to 100% our average load. Given we start failing at 80, the average is 40. Looking at the logs of kubectl top I got a line like this:

[2025-09-04 18:35:16] mathoid-mathoid-5fdb7bb68f-bwmkn mathoid-mathoid 762m 280Mi

Those numbers might feel surprising. Why are not they 50% of the max resource consumption? If only things were so simple. The app has probably loaded in memory most of the things needed by that point. Most of the resource consumption has happened already and is thus amortized across all requests. Also metrics-server is not updating every 2 seconds. For now this will do, but for more complex applications, it might make sense to have a lower ramp up time (less new users per second) to allow more fine grained data collection.

Anyway those numbers, rounded up, give us 800m and 300Mi. Those are our requests.

Record the CPU and Memory utilization

Code in now the 2 sets of numbers in the helm chart

resources:
  limits:
    cpu: 1000m
    memory: 400Mi
  requests:
    cpu: 800m
    memory: 300Mi

Craft an estimation of the amount of instances that will be required

The above might have felt scientific. The next part is way less. And that is because it has to do with the future and the future is overall unknown. The point here is to make a first order estimation (a somewhat educated guess) about the expected amount of requests per second that will reach the service. Take into account:

  • The age of the service
    • Is it new? Is it old?
    • Do we have historically data about usage?
  • The (probable) popularity of the service
    • Is it part of the wishlist?
    • Is this empowering editors? readers? applications?
  • Possible spikes
    • If a very popular enwiki template is edited, will this service end up receiving a ton of requests?
  • Needed availability
    • Consult with an SRE about this question. They know the DCs and availability zones better than anyone else. And they can help answer questions like the ones below.
    • If a hardware node that hosts an instance of this service fails, there can be up to 5 minutes before the pod is restarted somewhere else. Is this acceptable?
    • If a service is truly critical and needs very high availability, adding more instances does not guarantee availability. Rather, extra configuration might be required to make sure instances are spread well in the DC across availability zones. This adds more complexity and makes deployments more brittle. So, resist the urge to over provision to ensure availability and resist the urge to require more availability than necessary for your service.

With this you probably have a number. Divide this number by the RPS you calculated for the average scenario and round up. Let us say for simplicity's sake that we know that on average we receive ~180 rps for this service. Dividing this by 40 (our average RPS per instance from above) gives us 4.5. Round up to 5. We now confidently know that we can serve 200 rps and can go up to 400 rps before we start seeing delays and errors.

Codify it in our deployment-charts repo

Let us codify this in helmfile.d/services/<service>

replicas: 5

And that is it. Capacity planning is done.

What about when our estimate ends up being wrong?

The chances that our first order estimate is sufficient for a prolonged amount of time, are not very high. The world is in constant flux and we certainly have not accounted for something we were not even aware of (nor could we have been).

That being said, we have already run however the difficult parts, knowing the capabilities of 1 instance of our application. Barring major regressions, outages and rewrites, we do not need to re-run this process again.

So, we monitor our latency and error SLOs and can just add/remove capacity based on current events by just changing that replicas stanza in helmfile.d. Increasing/decreasing based on what we observe and need is 1 edit and 1 deployment away.

Should I be running this on my laptop/Cloud VM or production?

You might be inclined to say "I need to run this in production! That would be the best place, right?". However, that is not true. There are no problems that will plague your service if you run this on your laptop or a Cloud VM for the following reasons:

  • Memory patterns and usage are not going to be different regardless of where the benchmark is done
  • CPU usage might differ, but it is improbable that a VM or a laptop, with a consumer CPU or a shared with other workloads CPU, will have more CPU capacity than a server with a server class CPU. Chances are the application will be saturated sooner than on a server CPU. However, a CPU is a CPU as far as scheduling and kernel limiting goes. So, what will probably happen is that a single instance in production will be able to serve more requests before becoming saturated. Which means a more stable service for you.

So, do NOT try to account for server CPUs or the production environment. Rather perform this exercise in a way that will optimize for the consumption of your team's precious resources.

What about Rosetta/QEMU/KVM emulation? I have an arm64 laptop, I heard it is really bad for running amd64 containers.

There is some truth to the above statement. However, the workloads that most suffer are the ones that do a lot of input/output. If you application is doing a lot of IOPS, look into why. Consider what you can do to limit the reads and writes from disk. If it feels like a data store eventually, then it is out of scope for this guide anyway.

Otherwise, the CPU emulation that Rosetta 2 achieves has a pretty low emulation overhead. Which means that the answer to the above question applies here too.

Conclusion

Above, we have tried to describe a generalized process for saturating a single instance of an application, thus leading us understand how to size it. Then we utilize that "quantum" of the app to decide how many to run, taking into accounts current (if any) and expected traffic levels, including possibly spiky behavior, availability requirements and how they relate to WMF infrastructure. We have not covered data store capacity planning, as it is a sufficiently different subject that requires a dedicated page.