Portal:Cloud VPS/Admin/Openstack upgrade

From Wikitech
Jump to navigation Jump to search

We currently deploy OpenStack using Puppet and Debian Packages. Horizon is deployed from source, so the Horizon upgrade process is completely different.


Upgrading Openstack API services

Coupling and dependencies

In theory, different OpenStack projects (e.g. Nova, Glance, Designate) can run different API versions without issue, while the different services within a project (e.g. nova-conductor, nova-api, nova-compute, nova-scheduler) need to be in sync.

In practice, we run most of our services together on the same cloudcontrol nodes. So everything on the cloudcontrols needs to be upgraded together to avoid .deb dependency disasters. Designate, which runs on the cloudservices nodes, can be upgraded separately from the other services and/or run a different version.

Staging the upgrade in puppet

The puppet code for deploying OpenStack is split up based on OpenStack version. You can get an easy view of what this looks like by running

~/puppet$ find modules/openstack/ -name "*<current version>*"

Every file that appears in that list will need to be duplicated and modified to support the target release version. Fortunately, most OpenStack projects don't change much from version to version. The 'files' and 'templates' subdirs can just be duplicated; the manifests will need to be copied and then modified via search/replace to reflect the new version name.

Once all the files have been created and edited, commit them as a single giant patch, indicating that they are direct copies of the previous version.

Then... start reading release notes for the coming version. Files and manifests will need to be altered according to any upgrade or deprecation notes, and additional, subsequent patches can be committed based on these changes. All of these changes (including the initial copy commit) can be safely merged since they apply to a version that we aren't currently running.

Finally, make a patch that actually alters the running version. That patch will be very small and look something like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856648 Don't merge this last patch until you're ready to proceed with all of the following steps, and have scheduled a maintenance window.

Preparing for an upgrade

Upgrades should happen during a scheduled and announced maintenance window. During the upgrade window various API calls will fail and/or produce partial results; it's best to avoid user interaction with the APIs by disabling Horizon:

$ cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.cloudweb.set_maintenance --deployment eqiad1

API version upgrades will not interrupt running VMs, with one exception: When the cloudnet nodes are upgraded there will be a brief network outage during service failover.

The upgrade cookbook will backup OpenStack databases before upgrade, so there's no need to run advance backups.

Upgrading cloudcontrol nodes

Once the hiera patch to change the version string is merged, upgrading a cloudcontrol node can be done with the cloudcontrol.upgrade_openstack_node cookbook:

 cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudcontrollXXXX.wikimedia.org

The cookbook will upgrade a node, doing these general steps:

  1. Backup OpenStack databases
  2. Upgrade debian packages
  3. Upgrade OpenStack database schemas
  4. Apply Puppet
  5. Reboot

Once the first cloudcontrol node has been upgraded, all services running on other nodes will begin to fail due to database version incompatibility. Don't stop until you've upgraded everything else!

Because the cloudcontrol nodes also run Galera which relies on a two-node quorum, this must be done one cloudcontrol at a time. After each node has finished its reboot, log in and confirm that Galera is back in sync.

andrew@cloudcontrol1005:~$ sudo mysql -u root
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 42
Server version: 10.5.15-MariaDB-1:10.5.15+maria~bullseye-log mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+
1 row in set (0.001 sec)

Upgrading cloudnet nodes

We have a suite of network tests that can be used to confirm that the Neutron network is working properly before and after upgrade.

You can upgrade cloudnet nodes with the same upgrade_openstack_node cookbook, but it's important to do them in the right order to minimize network downtime. Start with the standby host, as determined with neutron l3-agent-list-hosting-router:

# neutron l3-agent-list-hosting-router cloudinstances2b-gw
neutron CLI is deprecated and will be removed in the Z cycle. Use openstack CLI instead.
+--------------------------------------+--------------+----------------+-------+----------+
| id                                   | host         | admin_state_up | alive | ha_state |
+--------------------------------------+--------------+----------------+-------+----------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | cloudnet1006 | True           | :-)   | standby  |
| 6a88c860-29fb-4a85-8aea-6a8877c2e035 | cloudnet1005 | True           | :-)   | active   |
+--------------------------------------+--------------+----------------+-------+----------+

Once the standby node is upgraded, confirm (by re-running the above command) that it is up and ready to accept traffic; it should show :-) under 'alive' and an ha_state of 'standby' again. If it's ready, run the cookbook on the active note. Failover should be almost immediate.

Re-enable Horizon

After the control and network nodes are upgraded, it's safe to re-enable Horizon. Cloudvirt nodes should still function with the old openstack version as long as it's only one release behind.

$ cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.cloudweb.unset_maintenance --deployment eqiad1

Upgrading cloudvirt nodes

If nova-compute has a version mismatch with nova-conductor, it will continue to respond to queries but refuse to schedule new VMs. To catch up, upgrade with the cloudvirt.live_upgrade_openstack cookbook:

cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.cloudvirt.live_upgrade_openstack --fqdn-to-upgrade cloudvirtXXXX.eqiad.wmnet

Upgrading a cloudvirt does not require a reboot.

TODO: automate upgrading all cloudvirts with one cookbook so the operator doesn't have to copy/paste 30 times

Upgrading cloudbackup nodes

The same cookbook that upgrades cloudcontrol nodes can be used on cloudbackup nodes as well. Determine the hostnames with:

# openstack volume service list | grep backup
| cinder-backup    | cloudbackup2002      | nova | enabled | up    | 2022-11-15T03:58:14.000000 |

Upgrading these hosts will probably break whatever backup is currently in process; missing a day should be OK as long as the service is healthy before and after.

Upgrading cloudservices nodes

The same cookbook that upgrades cloudcontrol nodes can be used on cloudservices nodes as well. Once the hiera version setting is applied, the cookbook should do everything necessary:

cookbook -c ~/.config/spicerack/cookbook_config.yaml wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudserviceslXXXX.wikimedia.org

Finally

Double-check the network tests to make sure things are in a decent state. Re-enable Horizon, and restart the fullstack test suite:

root@cloudcontrol1005:~# systemctl restart nova-fullstack.service

Then, watch the fullstack tests for a cycle or two to confirm that all is well.

Upgrading Horizon

Any given Horizon release is typically compatible with a variety of different OpenStack API versions, so the Horizon release version doesn't necessarily track the API release version. It's also easy to skip ahead versions in Horizon, so typically it leapfrogs ahead 2 or 3 releases and then waits for the API versions to catch up.


Staging the upgrade in puppet

There are relatively few version-specific files in puppet for Horizon. To get a list, try

$ find modules/openstack/ -name "*horizon*"

Duplicate each file, renaming to match the desired install version (example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/852998). There should be no manifest changes needed.

Finally, you'll need a version change patch, like this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856663

Staging the upgrade in git

We install Horizon from source; most of the source is hosted locally in order to track diffs against upstream. The top-level repo is 'horizon/deploy', found here: https://gerrit.wikimedia.org/r/admin/repos/openstack/horizon/deploy

All other necessary components are stored in submodules. Most of those submodules are duplicates of corresponding upstream packages hosted at review.opendev.org/. For upgrades, make a new branch in the 'deploy' repo and then rebase each submodule on the appropriate upstream branch, resolving conflicts as you go.

Ultimately for version X you should have a new upstream X branch for each submodule, and an upstream X branch for the 'repo' module that contains references to the proper version of each submodule. Once all that's put together, you need to rebuild your wheel subdir by running the 'make_wheels.sh' script. Be sure to build wheels on the same OS version that you plan to deploy on. Once the new wheels are built, commit them to a new X branch and update the deploy branch to refer to that wheels branch.

Now, you are ready to deploy.

Upgrading labweb nodes