Portal:Cloud VPS/Admin/Procedures and operations
See also Portal:Cloud_VPS/Admin/Maintenance |
This page describes some standard admin procedures and operations for our Cloud VPS deployments.
Manual routing failover
In the old nova-network days, a very long procedure was required to manually failover from a dead/under-maintenance network node (typically cloudnetXXXX).
Nowadays is much more simpler. This procedure assumes you want to move the active service from one node to the other:
Examples of neutron operations |
---|
root@cloudcontrol1007:~# openstack router list
+--------------------------------------+---------------------+--------+-------+---------+-------------+------+
| ID | Name | Status | State | Project | Distributed | HA |
+--------------------------------------+---------------------+--------+-------+---------+-------------+------+
| d93771ba-2711-4f88-804a-8df6fd03978a | cloudinstances2b-gw | ACTIVE | UP | admin | False | True |
+--------------------------------------+---------------------+--------+-------+---------+-------------+------+
root@cloudcontrol1007:~# openstack network agent list --agent-type l3 --router d93771ba-2711-4f88-804a-8df6fd03978a --long
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary | HA State |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | L3 agent | cloudnet1006 | nova | :-) | UP | neutron-l3-agent | standby |
| 6a88c860-29fb-4a85-8aea-6a8877c2e035 | L3 agent | cloudnet1005 | nova | :-) | UP | neutron-l3-agent | active |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
user@cloudnet1005:~ $ sudo systemctl stop neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-linuxbridge-agent.service
root@cloudcontrol1007:~# openstack network agent list --agent-type l3 --router d93771ba-2711-4f88-804a-8df6fd03978a --long
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary | HA State |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | L3 agent | cloudnet1006 | nova | :-) | UP | neutron-l3-agent | active |
| 6a88c860-29fb-4a85-8aea-6a8877c2e035 | L3 agent | cloudnet1005 | nova | :-) | UP | neutron-l3-agent | standby |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
|
By the time of this writing is not known which method produces less impact in terms of network downtime.
Remove hypervisor
Follow this procedure to remove a virtualizacion server (typically cloudvirtXXXX|labvirtXXXX).
- Remove or shutdown node
openstack hypervisor list
will still show itnova service-list
will show it as down once it's taken away:
| 9 | nova-compute | labtestvirt2003 | nova | disabled | down | 2017-12-18T20:52:59.000000 | AUTO: Connection to libvirt lost: 0 |
nova service-delete 9
will remove where the number is the id fromnova service-list
VM/Hypervisor pinning
In case you want to run a concrete VM in a concrete hypervisor, run the command at instance creation time with the --availability-zone option as in the following example:
user@cloudcontrol1005:~$ sudo wmcs-openstack server create --os-project-id testlabs --image debian-10.0-buster --flavor g2.cores1.ram2.disk20 --nic net-id=lan-flat-cloudinstances2b --property description='test VM' --availability-zone host:cloudvirt1022 mytestvm
Canary VM instance in every hypervisor
Each hypervisor should have a canary VM instance running.
The command to create/maintain them is:
user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1
To operate on just a couple of hypervisors:
user@laptop:~$ cookbook wmcs.openstack.cloudvirt.lib.ensure_canary --deployment eqiad1 --hostname-list cloudvirt1234 cloudvirt1235
Updating openstack database password
Openstack uses many databases, and updating the password requires several steps.
nova
We usually have the same password for the different nova databases nova_eqiad1 and nova_api_eqiad1.
- in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update the profile::openstack::eqiad1::nova::db_pass hiera key in hieradata/eqiad/profile/openstack/eqiad1/nova.yaml.
- in the puppet private repo (in puppetmaster1001.eqiad.wmnet), update class passwords::openstack::nova in modules/passwords/manifests/init.pp.
- in the openstack database (galera running in cloudcontrol nodes), update grants, something like:
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_api_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'208.80.153.x' IDENTIFIED BY '<%= @db_pass %>';
GRANT ALL PRIVILEGES ON nova_cell0_eqiad1.* TO 'nova'@'%' IDENTIFIED BY '<%= @db_pass %>';
- repeat grants for every cloudcontrol server IP and IPv6 address.
- update cell mapping database connection string (yes, inside the database itself) in any cloudcontrol server:
$ mysql nova_api_eqiad1;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_eqiad1' where id=4;
[nova_api_eqiad1]> update cell_mappings set database_connection='mysql://nova:<password>@openstack.eqiad1.wikimediacloud.org/nova_cell0_eqiad1' where id=1;
- run puppet everywhere (in cloudcontrol servers etc) so the new password is added to the config files.
- if puppet is not restarting the affected services, restart them by hand (
systemctl restart nova-api
, etc)
neutron
TODO: add information.
glance
TODO: add information.
designate
TODO: add information.
keystone
TODO: add information.
Rotating or revoking keystone fernet tokens
Should you need to rotate or revoke all keystone fernet tokens, follow this procedure:
- on all cloudcontrol nodes
rm -rf /etc/keystone/fernet-keys
- on one cloudcontrol node:
keystone-manage fernet_setup --keystone-user keystone --keystone-group keystone
- on each other cloudcontrol node:
rsync -a --delete rsync://<fqdn_of_the_host_where_you_ran_fernet_setup>/keystonefernetkeys/* /etc/keystone/fernet-keys/
- on labweb/cloudweb hosts:
service memcached restart service apache2 restart
Fixing nova VM wrong state
There are a number of state
-related fields for any given Nova VM. Those fields can get wrong, corrupted or disconnected from reality for a number of reasons. If a VM gets in a wrong state, it may prevent other workflows from running. One that is already traditional is
that an hypervisor cannot be drain if there is a VM in a wrong state.
Example of a wrong OS-EXT-STS:task_state
shelving
which will prevent the VM from being operated at all:
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | shelving |
| OS-EXT-STS:vm_state | stopped |
[..]
| status | SHUTOFF |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
To get out of this deadlock you may:
- force set the VM state to error
- then, force set the VM state to active
- try booting / rebooting to get to a correct ACTIVE or SHUTOFF state:
Example:
user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state error b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | error |
[..]
| status | ERROR |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server set --state active b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
[..]
| status | ACTIVE |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server reboot --hard b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | rebooting_hard |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2022-11-21T15:14:34.000000 |
[..]
| status | HARD_REBOOT |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | rebooting_hard |
| OS-EXT-STS:vm_state | active |
[..]
| status | HARD_REBOOT |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
user@cloudcontrol1007:~ $ echo # wait a few moments, and finally
user@cloudcontrol1007:~ $ sudo wmcs-openstack server stop b5597836-8691-4d66-897a-3fac56cbc539
user@cloudcontrol1007:~ $ sudo wmcs-openstack server show b5597836-8691-4d66-897a-3fac56cbc539
+-------------------------------------+-----------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------------------------+
[..]
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | stopped |
[..]
| status | SHUTOFF |
[..]
+-------------------------------------+-----------------------------------------------------------------------------------+
cloudvirt reboot
This procedure describes a safe cloudvirt hypervisor reboot without downtime.
TODO: expand what this means:
- check the list of running VMs
- run the cookbook
wmcs.openstack.cloudvirt.safe_reboot
- after the reboot, verify that nova sees the hypervisor as up and running