Wikimedia Cloud Services team/EnhancementProposals/2023 Openstack deployment workflow

Tracked in Phabricator
Task T326758

This page contains information about the 2023 WMCS team project to evaluate how we deploy openstack for Cloud VPS.

Context

The context which applies to this project.

Why doing this research

There were a number of indicators that pointed that doing this research was beneficial to us:

We currently deploy openstack with a combo of custom puppet code + Debian osbpo packages, which may not be a very supported combo upstream or by any other openstack community actors.
Our current openstack deployment utilizes a simple flat networking setup that requires usage of a now seemingly 'deprecated' / 'experimental' technology in upstream Openstack, see T326373: Neutron linuxbridge 'experimental' in Zed.
It also prevents us from adopting useful features, that cause issues or limitations when adopting new features, such as:
- Tenant networking, see T270694: CloudVPS: introduce tenant networks.
- Octavia (load balancers), see T321220: Openstack Magnum network setup.

Fixing these problems (and others, see additional context below) is not trivial, and redeploying our cloud from scratch may in fact be easier, specially if using automated / repeatable methods supported by upstream and other cloud operations in the openstack community / industry.

Current model (legacy)

When this project started the workflow/deployment paradigm was as follow:

bare metal hardware provisioning via WMF SRE workflows
puppet roles/profiles
- with no automated instrumentation of the openstack DB
server layout (cloudcontrol, cloudvirt, cloudnet, cloudrabbit, etc)
osbpo deb packages, see http://osbpo.debian.net/deb-status/
edge network model (cloudsw <-> cloudgw)
ceph storage cluster on baremetal, managed by puppet as a separate entity
a number of local customization and hacks, see Portal:Cloud_VPS/Admin/Local_customization_and_hacks.

Pain points

This current model (legacy) had a few pain points that we would like to improve:

no automation whatsoever for deploying or upgrading
no automation for maintaining DB schemas in sync with upstream
difficult upgrades, requiring manual updates to our puppet roles/profiles
difficult introduction of new openstack components, need to write a lot of puppet code
apparent (or perceived) low community adoption of osbpo deb packages exposes us to support failure scenarios
all local customization and hacks marked as 'technical debt'

Things to keep

Also, the current model (legacy) had some stuff that worth keeping, and we would like to still integrate some pieces of it, namely:

bare metal hardware provisioning via puppet and other WMF SRE workflows.
current server layout (cloudcontrol, cloudvirt, cloudnet, cloudrabbit, etc)
edge network model (cloudsw <-> cloudgw)
ceph storage cluster on baremetal, managed by puppet as a separate entity
some of our local customizations and hacks, the few things that make our openstack Cloud VPS
all local customization and hacks marked as 'worth keeping'

Open questions

There are a number of open questions that should be considered as part of the context for this projects.

Upgrades impact to users: in a potential new model, what downtime it introduces for end users when performing an upgrade from openstack version N to N+1.
Migrations impact to users: if adopting a new model, how such migration from old to new model will impact users.
Full automation: if adopting a new model with more focus on automation, we would like to get all the benefits of it. We would likely want to have the option to redeploy the whole cloud at any time, and in particular, during upgrades.
Timeline: if adopting a new model, an estimation of time that it will take for the team to complete the migration to it.
Sustainability: if adopting a new model, we would like to make sure that whatever that model is, is sustainable with the resourcing the WMCS team has (hardware resources, datacenter, human resources, etc)

Final proposal

After the evaluation, the top three options could ranked as follow:

. openstack-helm: with a baremetal kubernetes undercloud.
. kolla-ansible: the simplest and most convenient way to deploy to baremetal (apparently)
. openstack-ansible: similar to the above, but apparently a bit more cumbersome to operate (or at least, to experiment with)

Therefore the recommendations would be to try adopting openstack-helm.

Final proposal: a new model

This could be our new model if adopting openstack-helm.

In general, the new model integrates well with the context described in the section above, including the ceph storage farm, the hardware provisioning lifecycle, etc.

Some details worth noting follow.

New model: puppet layout

Since most of the logic to set up openstack itself would be under now openstack-helm control, the puppet mission would be reduced to ensure kubernetes itself can run in the cluster.

The puppet code for that already exists, which is in use by Toolforge (kubeadm).

New model: hardware

The hardware layout footprint would be nearly identical to the current (legacy) model.

We should keep having special meaningful names for each kind of server (cloudcontrol, cloudvirt, cloudnet, etc). This can be easily integrated into Kubernetes using tags, labels and such.

The hardware specification for the hosts (disks, memory, network interface cards, racking, etc) would remain the same, as we have today.

The main point to consider here is that we would need to have a container registry for our use, perhaps not deployed within Cloud VPS (to avoid chicken-egg problems).

New model: edge networking

No changes. We would retain the current CR <-> cloudsw <-> clougw edge topology.

From cloudgw inwards, we would need to decide what to do with the neutron SDN, if implementing tenant networks and how that would affect Cloud VPS as a service itself. But this particular bit has nothing to do with how we deploy openstack.

New model: testing & upgrade procedures

With openstack-helm we should reconsider our testing & upgrade procedures.

In particular, we should be able to introduce an additional testing tier, or to effectively separate development/staging setups.

Something like:

local development: something similar to Toolforge lima-kilo.
cloud development: a cloud-within-the-cloud development environment.
staging: we keep the codfw physical deployment. Having a physical mirror setup is still very valuable, as it allows us to test integration with physical stuff like datacenter things, network integration, etc.

New model: additional considerations about kubernetes

Adopting a kubernetes base layer / undercloud means that we will need to answer a bunch of questions:

We need a dedicated container registry for openstack-helm. By the way, similar to the kolla-ansible case. May or may not be inside the k8s deployment itself.
Do we fork openstack-helm code into our own repo for local modifications from day 0?
Do we want to integrate some gitops worklflow from day 0?
Do we want to keep the kubernetes deployment as stateless as possible? That would mean hosting the more stateful bits of openstack outside, basically the openstack DB.
Do we want to deploy etcd inside k8s or outside?
Do we create a local kubernetes development environment (similar to Toolforge lima-kilo) from day 0?

This is just a preview of the considerations, questions and tradeoffs we will face with dealing with the kubernetes undercloud.

Final proposal: migration strategy

If we decide to move forward with the new model, then the next question is how we migrate to it. Here are some ideas on how to accomplish that, defined as stages.

In general, the idea is to have a parallel deployment in each datacenter and relocate users/projects/workloads from one to other.

This means that during the migration there would be a total of four deployments.

eqiad1 (customer-facing, legacy model)
eqiad2 (customer-facing, new model)
codfw1dev (testing/staging, legacy model)
codfw2dev (testing/staging, new model)

migration stage 1: plan & procure hardware

We would need to decide what the minimal hardware would be to allow supporting the next stages.

migration stage 2: bootstrap new model

Create a baremetal kubernetes cluster and bootstrap the new model using openstack-helm. First we do this in codfw and later in eqiad.

We end with a new set of openstack API endpoints.

migration stage 3: testing

We start testing things in the new model and the integration with the legacy one.

It is highly likely that keystone & horizon would be shared between each pair of deployments. Same for networking integration and things like bastions and DNS domains.

So, things to test and figure out are, at least:

keystone
horizon
network connectivity between the each pair of deployments
bastion access between each pair of deployments
DNS domain names
data integration, for things like ceph, cinder volumes, NFS servers and databases
if we can live-migrate VMs between deployments

migration stage 4: migration

We start the actual relocation of workload from the old legacy model to the newer one.

This could take several forms, depending on what was tested to be more efficient in earlier stages:

organic migration: stop new workloads from being scheduled in the legacy deployment, only allow new workloads if they go to the new deployment
manual migration: coordinate with the community the relocation of VMs to the newer deployment
automatic migration (via some automation that we would need to invent)

migration stage 5: cleanup

Once workload migration has been done, the cleanup stage will hold several important changes like:

relocation / redeploy of any openstack component we couldn't migrate to the new model earlier (for example, keystone or horizon if shared)
hardware recycling
puppet cleanup

Final proposal: timeline

This is an estimation on the timeline that would involve adopting a new model based on openstack-helm.

stage 1: 6 months --- hardware procurement and setup can take a long time, as it depends on many external factors.
stage 2: 3 months --- if some work is done in parallel with the previous stage, this can be somewhat quick.
stage 3: 3 months --- if testing is done somewhat in parallel with previous stages, this can be somewhat quick.
stage 4: 12 months --- there are likely no shortcuts here. Depending on the type of migration, some end-user actions may or may not be required, and we need to give them lead time.
stage 5: 3 months --- can potentially be done somewhat in parallel with previous stage.
total: 27 months

Other options considered

List of options that were considered.

kolla-ansible

See: kolla-ansible https://docs.openstack.org/kolla-ansible/latest/

This option is widely used in the community. Is based on kolla containers (thin abstraction over docker containers).

kolla-ansible evaluation

Some notes about the kolla-ansible evaluation.

kolla-ansible evaluation: inside Cloud VPS

First evaluation in a single node setup inside Cloud VPS was a success and can be replicated as follow:

create a VM with the puppet role::wmcs::openstack::kolla_ansible_evaluation
then run wmcs-kolla-ansible-evaluation.sh

kolla-ansible evaluation: code patching

The kolla ecosystem has good documentation on how to work with local patches. See https://docs.openstack.org/kolla/latest/admin/image-building.html

Since we're talking about container images, it requires having a local image registry somewhere in our infrastructure.

openstack-ansible

See: openstack-ansible https://opendev.org/openstack/openstack-ansible

Can deploy code in two modes:

lxc: inside containers (default)
meta: directly without containers

openstack-ansible evaluation

Some notes about openstack-ansible evaluation on Cloud VPS.

The evaluation was conducted in both a raw virtual machine in a laptop and in a Wikimedia Cloud VPS virtual machine. In both cases, the setup was more cumbersome to operate compared to kolla-ansible.

openstack-ansible evaluation: inside Cloud VPS

For evaluation purposes the upstream recommendation is to use the All-in-one (aio) setup, which includes dedicated documentation, playbooks, etc.

It is difficult to have openstack-ansible running inside Cloud VPS as it is today for a number of reasons:

Regarding disk storage size, the aio setup requires more space than our available flavors provide. A separate, dedicated cinder volume is required to be attached to the VM for it to pass the size pre-checks.
There is an aio playbook that renames the VM hostname to aio1. If we instead use that hostname for the VM since the creation, then the playbook also renames the domain. This means, a rename from whatever.eqiad1.wikimedia.cloud to aio1.openstack.local. This effectively destroys the puppet connection, which is bad for a number of reasons.
The aio setup tries to SSH to itself using the above FQDN (rather than localhost), which is invalid per our SSH setup (by puppet).

A patch was created to try introducing a convenience script for evaluating inside Cloud VPS but as of this writing it hasn't been merged as it is not enough to get openstack up and running: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895789

openstack-ansible evaluation: code patching

By default, openstack-ansible will pull source code from upstream git repositories.

For operators to patch in their own code, there are several override mechanisms, the main one being selecting a different source code repository to deploy from.

See https://docs.openstack.org/openstack-ansible/latest/user/source-overrides/index.html

kayobe

See: Kayobe https://opendev.org/openstack/kayobe

why not adopting kayobe

This approach is mostly a combination of kolla-ansible and bifrost, to deploy a standalone ironic then manage the cloud with kolla-ansible.

The kolla-ansible is indeed very useful in our context. But, since we're mostly happy with our bare metal hardware provisioning via puppet and other WMF SRE workflows, this option as a whole gives little value to the WMCS team. See also why not adopting bifrost in this very page.

tripleo

See: TripleO https://opendev.org/openstack/tripleo-ansible https://opendev.org/openstack/tripleo-quickstart

why not adopting tripleo

We were about to research into this option when it was announced that the project would be discontinued, see https://lists.openstack.org/pipermail/openstack-discuss/2023-February/032083.html

Rumors are that Kubernetes is now the preferred option for running the undercloud, see kubernetes openstack-helm.

bifrost

See: Bifrost https://opendev.org/openstack/bifrost

Basically a stand-alone ironic deployment, based in ansible to manage baremetal hardware provisioning.

why not adopting bifrost

Since we're mostly happy with our bare metal hardware provisioning via puppet and other WMF SRE workflows, this option gives little value to the WMCS team.

kubernetes openstack-helm

See: openstack-helm https://opendev.org/openstack/openstack-helm/

Worth noting that some companies like RedHat are seriously moving in a direction of running openstack in kubernetes. See for example: Youtube: Modernizing your Red Hat OpenStack Platform’s operational.

why not adopting kubernetes openstack-helm

Kubernetes adds a lot of flexibility to manage workloads, automation, worker node lifecycle and even multiple Openstack deployments in a single k8s cluster. However, it also involves managing the lifecycle of kubernetes itself, which can be challenging for a number of reasons.

Our context is rather simple, see section above, and the balance between complexity (of the k8s layer) and benefits (additional scale, automation, etc) of this option suggests that we should go for something simpler.

If we were a full-scale service provider, with many regions, deployments and DCs dedicated to openstack, then things would probably balance in favor of this option.

Also, if we had access to a full, third-party managed kubernetes cluster, thus removing the need to manage the lifecycle of the undecloud on our own, this could be re-evaluated.

why adopting kubernetes openstack-helm

There a number of reasons that may make the price of having kubernetes worth paying.

Using openstack-helm could allow us to open the door for deploying multiple openstack in different namespaces, allowing some kind of Openstack-as-a-Service offering.
We already have puppet code to install a kubernetes cluster to baremetal (kubeadm, which is what Toolforge uses).
We could leverage lessons learned from Toolforge lima-kilo and develop a local development cloud from day 0 based on kubernetes. Imagine running Cloud VPS on your laptop for testing / devel purposes.
This would be a cloud-native architecture, and in that regard it should be better suited for the next 10 years of industry evolution

puppet-openstack

See: https://docs.openstack.org/puppet-openstack-guide/latest/index.html

A collection of upstream-supported puppet modules for deploying and maintaining openstack via puppet.

why not adopting puppet-openstack

The WMF-SRE operations/puppet.git tree is available to us, so this puppet-openstack option is not something to discard lightly.

However, openstack is a complex system with many services, and the puppet integration is also complex. We would need to introduce a lot of dependencies and code into the WMF-SRE operations/puppet.git repository. It also relies on deb packages for installation, which is something that we are suspicious of little community usage.

The documentation is scarce with no signs of getting better, with very low activity in the documentation repository, see https://opendev.org/openstack/puppet-openstack-guide/commits/branch/master

Overall, this option doesn't feel like moving us into the future and into the strongest community trends.

airship

See: https://www.airshipit.org/

A set of tools to deploy Openstack into kubernetes.

Why not adopting airship

The community has been very quiet for a little while now. There are mainly maintenance activities going on for the components that are running in production at a few companies, carried out by only a handful of people.

Probably not the level of activity/adoption/support we're looking for.

Source: private email from Open Infrastructure Foundation itself.

starlingx

See: https://www.starlingx.io/

Software to automatically deploy Kubernetes into containers/VMs/Baremetal and then inject a Openstack cloud on top of it, integrating both day1 and day2 operation workflows. It has a focus on distributed clouds (like, edge computing, IoT, etc).

why not adopting starlingx

The project is aimed for deployers of hundred of clouds in distributed fashions. While it does an extreme automation and simplification of the deployment process, we don't feel like the scale of the project matches our needs, and in that sense is an over-engineered solution for our use case.