Portal:Cloud VPS/Admin/Procedures and operations

From Wikitech

This page describes some standard admin procedures and operations for our Cloud VPS deployments.

Manual routing failover

In the old nova-network days, a very long procedure was required to manually failover from a dead/under-maintenance network node (typically cloudnetXXXX).

Nowadays is much more simpler. This procedure assumes you want to move the active service from one node to the other:

Alternatively you can play with other neutron commands to manage agents.

By the time of this writing is not known which method produces less impact in terms of network downtime.

Remove hypervisor

Follow this procedure to remove a virtualizacion server (typically cloudvirtXXXX|labvirtXXXX).

  • Remove or shutdown node
  • openstack hypervisor list will still show it
  • nova service-list will show it as down once it's taken away:

| 9 | nova-compute | labtestvirt2003 | nova | disabled | down | 2017-12-18T20:52:59.000000 | AUTO: Connection to libvirt lost: 0 |

  • nova service-delete 9 will remove where the number is the id from nova service-list

VM/Hypervisor pinning

In case you want to run a concrete VM in a concrete hypervisor, run the command at instance creation time with the --availability-zone option as in the following example:

user@cloudcontrol1005:~$ sudo wmcs-openstack server create --os-project-id testlabs --image debian-10.0-buster --flavor g2.cores1.ram2.disk20 --nic net-id=lan-flat-cloudinstances2b --property description='test VM' --availability-zone host:cloudvirt1022 mytestvm

Canary VM instance in every hypervisor

Each hypervisor should have a canary VM instance running.

The command to create/maintain them is:

user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1

To operate on just a couple of hypervisors:

user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1 --hostname-list cloudvirt1234 cloudvirt1235

Updating openstack database password

Openstack uses many databases, and updating the password requires several steps.

nova

We usually have the same password for the different nova databases nova_eqiad1 and nova_api_eqiad1.

  • in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update the profile::openstack::eqiad1::nova::db_pass hiera key in hieradata/eqiad/profile/openstack/eqiad1/nova.yaml.
  • in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update class passwords::openstack::nova in modules/passwords/manifests/init.pp.
  • in the openstack database (galera running in cloudcontrol nodes), update grants, something like:
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
  • repeat grants for every cloudcontrol server IP and IPv6 address.
  • update cell mapping database connection string (yes, inside the database itself) in any cloudcontrol server:
$ mysql nova_api_eqiad1;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_eqiad1' where id=4;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_cell0_eqiad1' where id=1;
  • run puppet everywhere (in cloudcontrol servers etc) so the new password is added to the config files.
  • if puppet is not restarting the affected services, restart them by hand (systemctl restart nova-api, etc)

neutron

TODO: add information.

glance

TODO: add information.

designate

TODO: add information.

keystone

TODO: add information.

Rotating or revoking keystone fernet tokens

Should you need to rotate or revoke all keystone fernet tokens, follow this procedure:

  • on all cloudcontrol nodes
rm -rf /etc/keystone/fernet-keys
  • on one cloudcontrol node:
keystone-manage fernet_setup --keystone-user keystone --keystone-group keystone
  • on each other cloudcontrol node:
rsync -a --delete rsync://<fqdn_of_the_host_where_you_ran_fernet_setup>/keystonefernetkeys/* /etc/keystone/fernet-keys/
  • on labweb/cloudweb hosts:
service memcached restart
service apache2 restart

Fixing nova VM wrong state

There are a number of state-related fields for any given Nova VM. Those fields can get wrong, corrupted or disconnected from reality for a number of reasons. If a VM gets in a wrong state, it may prevent other workflows from running. One that is already traditional is that an hypervisor cannot be drain if there is a VM in a wrong state.

Example of a wrong OS-EXT-STS:task_state shelving which will prevent the VM from being operated at all:

user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]                                                   
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | shelving                                                                          |
| OS-EXT-STS:vm_state                 | stopped                                                                           |
[..]                                                   
| status                              | SHUTOFF                                                                           |
[..]                                                   
+-------------------------------------+-----------------------------------------------------------------------------------+

To get out of this deadlock you may:

  • force set the VM state to error
  • then, force set the VM state to active
  • try booting / rebooting to get to a correct ACTIVE or SHUTOFF state:

Example:

user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state error b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | error                                                                             |
[..]
| status                              | ERROR                                                                             |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state active b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | active                                                                            |
[..]
| status                              | ACTIVE                                                                            |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server reboot --hard b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | rebooting_hard                                                                    |
| OS-EXT-STS:vm_state                 | active                                                                            |
| OS-SRV-USG:launched_at              | 2022-11-21T15:14:34.000000                                                        |
[..]
| status                              | HARD_REBOOT                                                                       |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]                                                   
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | rebooting_hard                                                                    |
| OS-EXT-STS:vm_state                 | active                                                                            |
[..]                                                   
| status                              | HARD_REBOOT                                                                       |
[..]                                                   
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ echo # wait a few moments, and finally
user@cloudcontrol1007:~ $ sudo wmcs-openstack server stop b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field                               | Value                                                                             |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state              | Shutdown                                                                          |
| OS-EXT-STS:task_state               | None                                                                              |
| OS-EXT-STS:vm_state                 | stopped                                                                           |
[..]
| status                              | SHUTOFF                                                                           |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+

cloudvirt reboot

This procedure describes a safe cloudvirt hypervisor reboot without downtime.

TODO: expand what this means:

  • check the list of running VMs
  • run the cookbook wmcs.openstack.cloudvirt.safe_reboot
  • after the reboot, verify that nova sees the hypervisor as up and running

See also