Portal:Cloud VPS/Admin/Virt capacity
Assessing current usage
The best place to see what cloudvirts are currently pooled or depooled is the scheduler pool list in hiera. Ideally, the comments next to that definition will explain the state of each cloudvirt.
How much is enough?
We have adequate current capacity if:
- All cloudvirts are averaging less than 50% CPU usage with little to no CPU blockage during usage spikes
- Even the busiest cloudvirts still have enough free RAM that we're never in danger of running OOM. In theory the nova scheduler ensures this but it's worth keeping an eye on.
- No cloudvirt has a drive more than 80% full. Because the scheduler permits 1.5x disk allocation, over 70% means that in an extreme case (where all COW VMs use all their disk space at once) we could hit 100% usage.
We have adequate future capacity if:
- We can experience 30% usage growth and have all of the above still be true. As we begin to eat into that 30% reserve it's time to start thinking about ordering new hardware.
Idle capacity, aka Lifeboats
We try to maintain one 'lifeboat' for every 10 cloudvirts. For example, as of March of 2019 we have 28 active cloudvirts, so should have 3 cloudvirts up and running but hosting no actual instances (or, actually, hosting one canary VM per host). The lifeboats are there in case VMs need to be evacuated off of a different cloudvirt because of a disk failure, impending DC work, or other issue that will take it off-line.
This is complicated by the fact that our cloudvirts are all different sizes, so some lifeboats (e.g. cloudvirt1008) might not be big enough to rescue all the VMs from a larger cloudvirt (e.g. cloudvirt1025). Best to avoid using only the smallest/slowest cloudvirts as lifeboats; ideally at least one of the newest or biggest cloudvirts is empty and ready at all times.
What happens if a cloudvirt is overloaded?
There are different failure cases depending on which resource a cloudvirt runs out of. These are the problems we are hoping to avoid by maintaining adequate capacity:
RAM: Instances will spontaneously shut down as libvirt and/or the OOM-killer struggles to free up memory for the system.
CPU: Instances will run more slowly or present as obviously CPU-starved
Disk: Instances will fail to write to their own file systems. Various other strange things can happen, including the potential of file corruption.