Portal:Cloud VPS/Admin/Openstack upgrade

We currently deploy OpenStack using Puppet and Debian Packages. Horizon is deployed from source, so the Horizon upgrade process is completely different.

OpenStack components

We are not running all the existing OpenStack components. These are the components that we installed for OpenStack version Zed:

cinder
glance
horizon
magnum
nova
trove
barbican
designate
heat
keystone
neutron
placement

This list might change in future upgrades.

Coupling and dependencies

In theory, different OpenStack components (e.g. Nova, Glance, Designate) can run different API versions without issue, while the different services within a component (e.g. nova-conductor, nova-api, nova-compute, nova-scheduler) need to be in sync.

In practice, we run most of our services together on the same cloudcontrol nodes. So everything on the cloudcontrols needs to be upgraded together to avoid .deb dependency disasters.

Staging the upgrade in puppet

The puppet code for deploying OpenStack is split up based on OpenStack version. You can get an easy view of what this looks like by running

~/puppet$ find modules/openstack/ -name "*<current version>*"

Every file that appears in that list will need to be duplicated and modified to support the target release version. Fortunately, most OpenStack components don't change much from version to version. The 'files' and 'templates' subdirs can just be duplicated; the manifests will need to be copied and then modified via search/replace to reflect the new version name.

Once all the files have been created and edited, commit them as a single giant patch, indicating that they are direct copies of the previous version. Example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/874906 (Note: in this example the files were committed in a stack of 13 patches, it's probably fine to create a single giant patch instead).

Then... start reading release notes for the coming version. Files and manifests will need to be altered according to any upgrade or deprecation notes, and additional, subsequent patches can be committed based on these changes. All of these changes (including the initial copy commit) can be safely merged since they apply to a version that we aren't currently running.

Upgrade the testing deployment (codfw1-dev)

Make a patch that alters the running version in the Testing deployment. That patch will be very small and look something like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/876020. You can merge that patch straight away as it will only affect the testing deployment, but please write a message in IRC (#wikimedia-cloud-admin) to notify other admins that the testing deployment might break. Then follow the steps in the section "Upgrade steps" below.

Upgrade the main deployment (eqiad1)

Once you are satisfied everything is working in the testing deployment, repeat all the above steps for the main deployment (eqiad1).

Upgrades to the main deployment should happen during a scheduled and announced maintenance window.

Upgrade the cloudservices* hosts first, with a puppet patch like this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/854096. Don't merge this last patch until you're ready to proceed with the upgrade and have scheduled a maintenance window. When you are ready, disable puppet on the cloudservices* hosts, merge the patch and run the upgrade cookbook (see Upgrading cloudservices nodes below).

Then upgrade all the other hosts with a patch like this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856648.

Before merging the patch, disable puppet on all affected hosts. A Cumin query like this one should work, but verify the list of hosts that is returned:

fnegri@cloudcumin1001:~$ sudo cumin 'P{C:openstack::serverpackages::zed::bookworm} and A:eqiad' 'disable-puppet Txxxxxx'

Merge it and run the upgrade cookbooks on all cloudcontrols, cloudnets, cloudvirts as detailed below.

Upgrade steps (common to both deployments)

These steps are common to both deployments, make sure to specify the correct one where a command includes the deployment name (e.g. --deployment eqiad1).

Upgrading Openstack API services

Preparing for an upgrade

During the upgrade window various API calls will fail and/or produce partial results; it's best to avoid user interaction with the APIs by disabling Horizon:

$ cookbook wmcs.openstack.cloudweb.set_maintenance --deployment (codfw1dev|eqiad1) --task-id Txxxxxx

API version upgrades will not interrupt running VMs, with one exception: When the cloudnet nodes are upgraded there will be a brief network outage during service failover.

Once the hiera version setting is applied, Puppet will update all the configuration files but will NOT perform the actual upgrade of the packages. That is done by running a cookbook and varies slightly for each node type (cloudservices, cloudcontrols, cloudnets, cloudvirts). More details for each type are provided below.

The upgrade cookbook will backup OpenStack databases before upgrade, so there's no need to run advance backups.

Upgrading cloudservices nodes

The cookbook should do everything necessary, without any impact to end users:

cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudservicesXXXX.eqiad.wmnet --task-id Txxxxxx

No actual openstack services run on cloudservices nodes, so the upgrade should be fairly trivial, only upgrading client packages.

The cookbook will upgrade a node, doing these general steps:

Backup OpenStack databases
Upgrade Debian packages
Upgrade OpenStack database schemas
Apply Puppet
Reboot

Upgrading cloudcontrol nodes

Upgrading a cloudcontrol node can be done with the cloudcontrol.upgrade_openstack_node cookbook:

cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudcontrolXXXX.wikimedia.org --task-id Txxxxxx

Once the first cloudcontrol node has been upgraded, all services running on other nodes will begin to fail due to database version incompatibility. Don't stop until you've upgraded everything else!

Because the cloudcontrol nodes also run Galera which relies on a two-node quorum, this must be done one cloudcontrol at a time. After each node has finished its reboot, log in and confirm that Galera is back in sync.

andrew@cloudcontrol1005:~$ sudo mariadb
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 42
Server version: 10.5.15-MariaDB-1:10.5.15+maria~bullseye-log mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+
1 row in set (0.001 sec)

Upgrading cloudnet nodes

We have a suite of network tests that can be used to confirm that the Neutron network is working properly before and after upgrade:

cookbook wmcs.openstack.network.tests --cluster-name (codfw1dev|eqiad1) --task-id Txxxxxx

You can upgrade cloudnet nodes with the upgrade_openstack_node cookbook:

cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudnetXXXX.eqiad.wmnet --task-id Txxxxxx

It's important to do them in the right order to minimize network downtime. Start with the standby host, as determined with neutron l3-agent-list-hosting-router:

# openstack network agent list --agent-type l3 --router  cloudinstances2b-gw --long
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| ID                                   | Agent Type | Host         | Availability Zone | Alive | State | Binary           | HA State |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | L3 agent   | cloudnet1006 | nova              | :-)   | UP    | neutron-l3-agent | standby  |
| 6a88c860-29fb-4a85-8aea-6a8877c2e035 | L3 agent   | cloudnet1005 | nova              | :-)   | UP    | neutron-l3-agent | active   |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+

Once the standby node is upgraded, confirm (by re-running the above command) that it is up and ready to accept traffic; it should show :-) under 'alive' and an ha_state of 'standby' again. If it's ready, run the cookbook on the active note. Failover should be almost immediate.

Re-enable Horizon

After the control and network nodes are upgraded, it's safe to re-enable Horizon. Cloudvirt nodes should still function with the old openstack version as long as it's only one release behind.

$ cookbook wmcs.openstack.cloudweb.unset_maintenance --deployment (codfw1dev|eqiad1) --task-id Txxxxxx

Upgrading cloudvirt nodes

If nova-compute has a version mismatch with nova-conductor, it will continue to respond to queries but refuse to schedule new VMs. To catch up, upgrade with the cloudvirt.live_upgrade_openstack cookbook:

cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack --fqdn-to-upgrade cloudvirtXXXX.eqiad.wmnet --task-id Txxxxxx

Upgrading a cloudvirt does not require a reboot.

After upgrading the first cloudvirt, check if things are working as expected:

you should be able to SSH to a VM running in that cloudvirt (use openstack server list --all-projects --host cloudvirtXXXX from a cloudcontrol to find a list of running VMs)
there should be no errors in the Nova logs in the upgraded cloudvirt (journalctl -u nova-compute.service)
the cloudvirt should be listed as up if you run openstack compute service list from a cloudcontrol host

Once you have upgraded a couple cloudvirts and things seem to be working fine, you can use a simple for loop to upgrade a few ones in a row without user intervention:

# run this from cloudcuminXXXX in a screen/tmux session
for i in {0..9}; do sudo cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack --fqdn-to-upgrade cloudvirt105$i.eqiad.wmnet --task-id T348843; done

TODO: modify the cookbook so that it can upgrade more than one cloudvirt at a time.

Finally

Double-check the network tests to make sure things are in a decent state:

cookbook wmcs.openstack.network.tests --cluster-name (codfw1dev|eqiad1) --task-id Txxxxxx

Re-enable Horizon, and restart the fullstack test suite:

root@cloudcontrol1005:~# systemctl restart nova-fullstack.service

Then, watch the fullstack tests for a cycle or two to confirm that all is well.

Please note that nova-fullstack.service is only installed in one cloudcontrol host.

Upgrading Horizon

Any given Horizon release is typically compatible with a variety of different OpenStack API versions, so the Horizon release version doesn't necessarily track the API release version. It's also easy to skip ahead versions in Horizon, so typically it leapfrogs ahead 2 or 3 releases and then waits for the API versions to catch up.

Staging the upgrade in puppet

There are relatively few version-specific files in puppet for Horizon. To get a list, try

$ find modules/openstack/ -name "*horizon*"

Duplicate each file, renaming to match the desired install version (example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/852998). There should be no manifest changes needed.

Finally, you'll need a version change patch, like this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856663

Staging the upgrade in git

We install Horizon from source; most of the source is hosted locally in order to track diffs against upstream. The top-level repo is 'horizon/deploy', found here: https://gerrit.wikimedia.org/r/admin/repos/openstack/horizon/deploy

All other necessary components are stored in submodules. Most of those submodules are duplicates of corresponding upstream packages hosted at review.opendev.org/. For upgrades, make a new branch in the 'deploy' repo and then rebase each submodule on the appropriate upstream branch, resolving conflicts as you go.

Ultimately for version X you should have a new upstream X branch for each submodule, and an upstream X branch for the 'repo' module that contains references to the proper version of each submodule. Once all that's put together, you need to rebuild your wheel subdir by running the 'make_wheels.sh' script. Be sure to build wheels on the same OS version that you plan to deploy on. Once the new wheels are built, commit them to a new X branch and update the deploy branch to refer to that wheels branch.

Now, you are ready to deploy.

Upgrading labweb nodes

[todo]