Jump to content

Portal:Cloud VPS/Admin/Procedures and operations

From Wikitech

This page describes some standard admin procedures and operations for our Cloud VPS deployments.

Manual routing failover

In the old nova-network days, a very long procedure was required to manually failover from a dead/under-maintenance network node (typically cloudnetXXXX).

Nowadays is much more simpler. This procedure assumes you want to move the active service from one node to the other:

By the time of this writing is not known which method produces less impact in terms of network downtime.

Remove hypervisor

Follow this procedure to remove a virtualizacion server (typically cloudvirtXXXX|labvirtXXXX).

  • Remove or shutdown node
  • openstack hypervisor list will still show it
  • nova service-list will show it as down once it's taken away:

| 9 | nova-compute | labtestvirt2003 | nova | disabled | down | 2017-12-18T20:52:59.000000 | AUTO: Connection to libvirt lost: 0 |

  • nova service-delete 9 will remove where the number is the id from nova service-list

VM/Hypervisor pinning

In case you want to run a concrete VM in a concrete hypervisor, run the command at instance creation time with the --availability-zone option as in the following example:

user@cloudcontrol1005:~$ sudo wmcs-openstack server create --os-project-id testlabs --image debian-10.0-buster --flavor g2.cores1.ram2.disk20 --nic net-id=lan-flat-cloudinstances2b --property description='test VM' --availability-zone host:cloudvirt1022 mytestvm

Canary VM instance in every hypervisor

Each hypervisor should have a canary VM instance running.

The command to create/maintain them is:

user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1

To operate on just a couple of hypervisors:

user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1 --hostname-list cloudvirt1234 cloudvirt1235

Updating openstack database password

Openstack uses many databases, and updating the password requires several steps.


We usually have the same password for the different nova databases nova_eqiad1 and nova_api_eqiad1.

  • in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update the profile::openstack::eqiad1::nova::db_pass hiera key in hieradata/eqiad/profile/openstack/eqiad1/nova.yaml.
  • in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update class passwords::openstack::nova in modules/passwords/manifests/init.pp.
  • in the openstack database (galera running in cloudcontrol nodes), update grants, something like:
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
  • repeat grants for every cloudcontrol server IP and IPv6 address.
  • update cell mapping database connection string (yes, inside the database itself) in any cloudcontrol server:
$ mysql nova_api_eqiad1;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_eqiad1' where id=4;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_cell0_eqiad1' where id=1;
  • run puppet everywhere (in cloudcontrol servers etc) so the new password is added to the config files.
  • if puppet is not restarting the affected services, restart them by hand (systemctl restart nova-api, etc)


TODO: add information.


TODO: add information.


TODO: add information.


TODO: add information.

Rotating or revoking keystone fernet tokens

Should you need to rotate or revoke all keystone fernet tokens, follow this procedure:

  • on all cloudcontrol nodes
rm -rf /etc/keystone/fernet-keys
  • on one cloudcontrol node:
keystone-manage fernet_setup --keystone-user keystone --keystone-group keystone
  • on each other cloudcontrol node:
rsync -a --delete rsync://<fqdn_of_the_host_where_you_ran_fernet_setup>/keystonefernetkeys/* /etc/keystone/fernet-keys/
  • on labweb/cloudweb hosts:
service memcached restart
service apache2 restart

Fixing nova VM wrong state

There are a number of state-related fields for any given Nova VM. Those fields can get wrong, corrupted or disconnected from reality for a number of reasons. If a VM gets in a wrong state, it may prevent other workflows from running. One that is already traditional is that an hypervisor cannot be drain if there is a VM in a wrong state.

Example of a wrong OS-EXT-STS:task_state shelving which will prevent the VM from being operated at all:

user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | shelving                                                                          |
| OS-EXT-STS:vm_state                 | stopped                                                                           |
| status                              | SHUTOFF                                                                           |

To get out of this deadlock you may:

  • force set the VM state to error
  • then, force set the VM state to active
  • try booting / rebooting to get to a correct ACTIVE or SHUTOFF state:


user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state error b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | error                                                                             |
| status                              | ERROR                                                                             |
user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state active b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | active                                                                            |
| status                              | ACTIVE                                                                            |
user@cloudcontrol1007:~ $ sudo wmcs-openstack server reboot --hard b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | rebooting_hard                                                                    |
| OS-EXT-STS:vm_state                 | active                                                                            |
| OS-SRV-USG:launched_at              | 2022-11-21T15:14:34.000000                                                        |
| status                              | HARD_REBOOT                                                                       |
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | rebooting_hard                                                                    |
| OS-EXT-STS:vm_state                 | active                                                                            |
| status                              | HARD_REBOOT                                                                       |
user@cloudcontrol1007:~ $ echo # wait a few moments, and finally
user@cloudcontrol1007:~ $ sudo wmcs-openstack server stop b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
| Field                               | Value                                                                             |
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | stopped                                                                           |
| status                              | SHUTOFF                                                                           |

cloudvirt reboot

This procedure describes a safe cloudvirt hypervisor reboot without downtime.

TODO: expand what this means:

  • check the list of running VMs
  • run the cookbook wmcs.openstack.cloudvirt.safe_reboot
  • after the reboot, verify that nova sees the hypervisor as up and running

See also