Wikimedia Cloud Services team/EnhancementProposals/Kubernetes Capacity Planning

From Wikitech

There are several measures that can be considered when capacity designing for a Kubernetes cluster. In our current use case, the primary driver of cluster capacity consumption is the number of webservices deployed at any given time. As use cases change and develop, this design should be reviewed for where changes are needed.

Nodes

The most visible consumable unit in the cluster is worker nodes. Worker nodes in our current cluster design are m1.large OpenStack instances running Debian Buster. Each node has 4 vcpus and 8GB of RAM to contribute with 80 GB of disk (all thin provisioned on the hypervisors).

Webservices

As currently designed, a webservice is presently a single-instance web application with no redundancy. It is kept under a Deployment/ReplicaSet to keep it restarting itself when it fails. Until the toollabs-webservice package is updated to use the v1/apps version of Deployments, that is the only value obtained from the Deployment object because automatic garbage collection and cascading deletes is a feature of the standard objects, not the beta extensions. Webservices are therefore measurable as somewhat heavily-provisioned pods as far as the actual impact on the Kubernetes cluster is concerned.

The current cluster has about 709 webservices running with 28 pods running a different configuration (some of them rather heavy-use like celery workers).

If each pod is expecting 1/4 cpu, then anything over 16 is oversubscribed even if it doesn't "burst" to using more than its initial request (which is acceptable, within reason). If every one of them requests a prefetch of 1 GB of RAM, we have something more like the old cluster's mind-blowing levels of oversubscription (300-600%) on an already potentially oversubscribed resource (OpenStack VM). We are starting webservice pods at 1/4 CPU and only requesting a small amount of RAM (250MB--though surely most java pods will need more, for example). Since many tools are quiet things, and oversubscription makes the flowers grow, we may want to engineer targeting one worker node per 25 small webservices (50% oversubscription in CPU at defaults and 6GB of RAM requested--the limit for burst performance defaults to 0.5 CPU with 512MB RAM, so the RAM oversubscription will report at 50%). However, we should expect many services to require and consume more than the minimum and more than defaults. Assuming 1% consume up to the maximum self-service limits, that is a full CPU and 4 GB of RAM, and making the arbitrary assumption that the rest just use defaults, we can use an initial plan of having 11 webservices per node for 7 nodes and maintaining a 50% oversubscription on CPU and 10% oversubscription of RAM. We should expect plenty of services using prefetch-hungry languages to need more RAM than CPU, so that should end up a much higher but still reasonable RAM oversubscription. This is not meant to express exactly where we are so much as to make arbitrary decisions that model both a reasonable estimate and some idea of how growth might look and need to be responded to.

11 services per node for 7 nodes, ignoring non-webservice pods since they are in the minority for now, leaves us with (25+7) 32 nodes bare minimum. 40 worker nodes is therefore providing 20% reserve capacity. Since that assumes the vast majority use defaults, we probably should fudge upward and start at 45.

From there we can continue to measure capacity and metrics, keeping 50% cpu oversubscription as a general goal and 20% reserve capacity.

Ingress

TODO

DNS

TODO