This page needs a refresh. See Wikimedia_Cloud_Services_team/Alerts for urgent issues, or Portal:Cloud_VPS/Admin/Runbooks for specific alert procedures

This page is for troubleshooting urgent issues. Routine maintenance tasks are documented on the VPS_Maintenance page.

Networking failures

Specific issues regarding base networking.

The network is deployed following the model/architecture described in the Neutron main article.

neutron agents alive/dead

We detected that in our Mitaka-based deployments there could be some issues with the RabbitMQ setup that results in dropped messages.

One of the most visible effects is neutron agents (l3, linuxbridge, metadata, etc) go dead/alive:

root@cloudcontrol1007:~# watch openstack network agent list
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| ID                                   | Agent Type         | Host               | Availability Zone | Alive | State | Binary                    |
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| 0132a925-68a1-4efc-a495-f5e064dd2742 | Linux bridge agent | cloudvirt1049      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 10445b2d-a083-43d8-b04f-0ca347ed00b4 | Linux bridge agent | cloudvirt1053      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 165a5fa5-9e27-433d-911e-1bba97cf9a93 | Linux bridge agent | cloudvirt1064      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 198c4e26-a835-4f31-863e-dcea5b462ddf | Linux bridge agent | cloudvirt1051      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 1a3ec082-7221-4002-82d4-464b52c2634c | Linux bridge agent | cloudvirtlocal1002 | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 1c394148-02d0-4a06-892c-f5cec29ef0b0 | Linux bridge agent | cloudvirt1037      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 1dca6733-29df-465e-bca7-a69dec7b07e6 | Linux bridge agent | cloudvirt1067      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 20854226-8d18-4d1f-8da3-a68a8fd4dc9f | Linux bridge agent | cloudvirt1033      | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 276593cf-b604-4a1a-9a89-6ad12d5aff02 | Linux bridge agent | cloudvirt1054      | None              | :-)   | UP    | neutron-linuxbridge-agent |
[...]

A restart of the RabbitMQ service in the control server should help fix things:

root@cloudcontrol1003:~# systemctl stop rabbitmq-server.service ; systemctl start rabbitmq-server.service
[...]

duplicated packets in the network

This is very likely caused because there are 2 neutron l3-agents running in active mode.

Symptoms are usually:

asymmetric routing
DUP! pings
unreachable bastions and/or instances

Can be easily checked running this command:

root@cloudcontrol1007:~# openstack network agent list --agent-type l3 --router d93771ba-2711-4f88-804a-8df6fd03978a --long
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| ID                                   | Agent Type | Host         | Availability Zone | Alive | State | Binary           | HA State |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+
| 3f54b3c2-503f-4667-8263-859a259b3b21 | L3 agent   | cloudnet1006 | nova              | :-)   | UP    | neutron-l3-agent | standby  |
| 6a88c860-29fb-4a85-8aea-6a8877c2e035 | L3 agent   | cloudnet1005 | nova              | :-)   | UP    | neutron-l3-agent | active   |
+--------------------------------------+------------+--------------+-------------------+-------+-------+------------------+----------+

If two of them are active, then you have to disable one, following the procedured described in details in the procedure for manual routing failover, but summarized here (try them in order until you get to a working state):

root@cloudcontrol1003:~# neutron agent-update <l3-agent uuid> --admin-state-down

user@cloudnet1004:~$ sudo systemctl stop neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-linuxbridge-agent.service

```
user@cloudnet1004:~$ sudo reboot
```

network down for VMs in a cloudvirt (the agents case)

We observed that from time to time, neutron agents can be in a crazy loop of going up/down, resyncing with the server and then going down again.

Symptoms are usually:

one or more cloudvirt lost connectivity for the VMs running on it, in cycles (i.e, network is working, and next minute is not).
networking in CloudVPS is inconsistent in general.
neutron agents going alive/dead randomly

Unfortunately, we don't know a solution for this other than restart stuff and hope it comes alive well. We observed some issues in message queues (rabbitmq), which points towards dropped or unreplied messages. To counter this, we tunned heartbeat messages and timeouts trying to get the smarter version of neutron we can. This is probably a bug in the Mitaka suite and fixed in future versions.

See Neutron agents alive/dead.

network down for VMs in a cloudvirt (the VLAN case)

The other common case is issues with the physical switches in the rack. Each cloudvirt/cloudnet box need a concrete VLAN/trunking configuration in each interface to work. Also, switches need to be trunked between them, so VLANs are connected between switches.

Symptoms are usually:

there has been some work recently on switches or physical equipment near this server
there has been a rename/install/reimage related to this server
if you do a tcpdump in the affected NIC (usually eth1), you will see packets flowing in outbound direction only

Please talk to DC-ops or Traffict them to re-check switching configuration, as described in the main Neutron article.

VM cannot communicate with other VMs in different cloudvirt (the ARP table case)

This only happened once so far but it's documented here just in case it happens again (phab:T209166).

If a VM can communicate with everything except for a few other VMs running in a different cloudvirt, that could mean the physical switch that is connecting the cloudvirts might be confused about where to forward packets. This is a pure L2 networking issue.

To confirm this is really the case, make a list of all the other VMs that this VM cannot talk to. Narrow it down to the cloudvirt(s) that are experiencing this issue.

Next, use tcpdump at each layer to confirm that packets are leaving one VMs and not making their way to the destination VM (maybe while ping is running between both VMs):

Source VM: tcpdump -n -i eth0 icmp or arp
Source cloudvirt bridge: tcpdump -n -e -i br* '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
Source cloudvirt eth1.1105 ("wire"): tcpdump -n -e -i eth1.1105 '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'

And then on the destination cloudvirt:

Destination cloudvirt eth1.1105 ("wire"): tcpdump -n -e -i eth1.1105 '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
Destination cloudvirt bridge: tcpdump -n -e -i br* '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
Destination VM: tcpdump -n -i eth0 icmp or arp

Hopefully, the tcpdump rules will prevent too many packets from flying through the screen and you'll be able to tell where things are breaking.

You can also run tcpdump on the tap-* interface that belongs to each VM (use virsh dumpxml to discover the instance/tap/instance_name combination).

If you see that packets are leaving the source VM, arriving on the destination VM, being replied by the destination VM but never back it back to the source VM; that could be the issue.

In that case, contact the networking staff and ask them to remove the source/destination MAC addresses from the switch's ARP table (even if they look correct). For example, <asw2-b-eqiad> clear ethernet-switching table fa:16:3e:91:a2:93

Network communication between cloudvirts involves:

Each VM being a QEMU process and having a tap-* to talk to the outside world
This tap-* interface being added to a bridge on the cloudvirt
The VLAN 1105 interface being part of the bridge as well (so packets can leave the cloudvirt and go to other cloudvirts)
The switch between cloudvirts
OpenStack security groups (that get translated to iptables rules)

Notice it doesn't involve the cloudnet hosts because communication is internal to the cluster.

nf_conntrack table full in cloudnet nodes

The nf_conntrack table has a couple of parameters that controls how many memory network traffic can use in the server. Is a potential remote DoS factor, so this value should be properly monitored and adjusted.

If there is an icinga alert Check nf_conntrack usage in neutron netns, specifically nf_conntrack usage over 80% in netns qrouter-xxxxxxxxx, this means the neutron virtual router might be dropping network packets.

First, check the icinga alert is true by doing the following:

run the check script by hand in the affected cloudnet server, and see if this is a check script failure. An OK debug output is showed as example:

root@cloudnet1003:~# /usr/local/sbin/nrpe-neutron-conntrack -d
DEBUG: magic usage value is 80% (hardcoded)
DEBUG: detected netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a
DEBUG: detected netns qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4
DEBUG: qrouter-d93771ba-2711-4f88-804a-8df6fd03978a net.netfilter.nf_conntrack_count 159997
DEBUG: qrouter-d93771ba-2711-4f88-804a-8df6fd03978a net.netfilter.nf_conntrack_max 1048576
DEBUG: qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4 net.netfilter.nf_conntrack_count 0
DEBUG: qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4 net.netfilter.nf_conntrack_max 1048576
DEBUG: 15.258502960205078% usage in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a
DEBUG: not evaluating netns qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4
OK: everything is apparently fine

check prometheus/grafana to see if this was something transient or if this is for real. Usually: https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=1m but can also check Portal:Cloud_VPS/Admin/Monitoring for other potential dashboards.

check values by hand. If count is close to max, then we have problems:

root@cloudnet1003:~# ip netns list
qrouter-d93771ba-2711-4f88-804a-8df6fd03978a (id: 1)
qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4 (id: 0)
root@cloudnet1003:~# ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 1048576
root@cloudnet1003:~# ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 167515

The values are managed via puppet, but for quick first aid, you can update the max by hand:

root@cloudnet1003:~# ip netns list
qrouter-d93771ba-2711-4f88-804a-8df6fd03978a (id: 1)
qdhcp-7425e328-560c-4f00-8e99-706f3fb90bb4 (id: 0)
root@cloudnet1003:~# ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a sysctl -w net.netfilter.nf_conntrack_max=10485770
net.netfilter.nf_conntrack_max = 10485770
root@cloudnet1003:~# ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 10485770

once the quick first aid was done, you can check why this is happening.
- Are we under high network load? If so, consider increasing the max value. Example patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/655407
- Were the values correctly set? We have experienced some issues in the past when settings the values via puppet. See phabricator T257552 - cloudnet: prometheus node exporter stopped collecting nf_conntrack_entries metric for reference.

Galera

Galera related troubleshooting.

Galera won't start up

Galera runs an instance on each of the cloudcontrol nodes. If any one of them restarts it should be able to rejoin the existing cluster without trouble. If the whole cluster is down, galera will refuse to start up again to avoid split-brain. This manifests as mysqld not starting, and a log message like:

root@cloudcontrol2001-dev:~# journalctl -u mariadb -f
[..]
WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
[..]

To restart the cluster from scratch (from the primary node, generally the lowest-number cloudcontrol), first tell Galera that it's ok to create a new cluster by editing /srv/sqldata/grastate.dat (it was /var/lib/mysql/grastate.dat on Buster):

[..]
safe_to_bootstrap: 1

Then, bootstrap the new cluster:

root@cloudcontrol2001-dev:~# galera_new_cluster
[..]
root@cloudcontrol2001-dev:~# systemctl start mariadb

Finally, make sure that we can't re-bootstrap on top of the now-running cluster by changing /srv/sqldata/grastate.dat back to how it was:

[..]
safe_to_bootstrap: 0

You can now check that mariadb starts correctly in the whole cluster:

user@cumin2001:~ $ sudo cumin --force -x A:cloudcontrol-codfw1dev 'systemctl status mariadb'
3 hosts will be targeted:
cloudcontrol[2001,2003-2004]-dev.wikimedia.org
===== NODE GROUP =====                           
[..]

Galera backup recovery

In case of complete catastrophe, the DBs hosted on Galera are backed up daily by Bacula. Backups can be recovered following the Bacula#Restore_(aka_Panic_mode) guide. Restoring will provide you with a file.sql.gz dump of each database.

Backups are run on each node. If everything has been going well, the backups should be almost entirely redundant; when restoring choose the most-recently-backed-up host. If, on the other hand, we've suffered a split-brain, you will need to make a choice about which node to choose as the winner, i.e, the backup from which node to restore.

Once the mysql service is running in at least one server, follow the example commands to restore a given database:

root@cloudcontrol2001-dev:~# gunzip /var/tmp/bacula-restores/srv/backups/glance-202007030408.sql.gz
root@cloudcontrol2003-dev:~# mysql -u root -e "CREATE DATABASE IF NOT EXISTS glance;"
root@cloudcontrol2001-dev:~# mysql -u root glance < /var/tmp/bacula-restores/srv/backups/glance-202007030408.sql
root@cloudcontrol2003-dev:~# mysql -u root -e "USE glance; SHOW TABLES;"
+----------------------------------+
| Tables_in_glance                 |
+----------------------------------+
| alembic_version                  |
| image_locations                  |
| image_members                    |
| image_properties                 |
| image_tags                       |
| images                           |
| metadef_namespace_resource_types |
| metadef_namespaces               |
| metadef_objects                  |
| metadef_properties               |
| metadef_resource_types           |
| metadef_tags                     |
| migrate_version                  |
| task_info                        |
| tasks                            |
+----------------------------------+

Nova-fullstack

Nova-fullstack is a testing agent that runs on one of the nova nodes, currently cloudcontrol1003. It periodically creates a new VM, waits for it to launch and run puppet, then verifies the dns record and tests ssh access.

If a VM launches but is for any reason unreachable after the specified timeout, fullstack gives up and leaves the VM behind for future research. If a large number (currently 6) VMs are leaked like this, an alert will fire. Icinga will raise a warning if a smaller number of leaks are detected. Note that these leaked instances might be the result of an ongoing failure, or they might be evidence of a rare/intermittent failure that has accumulated leaks over many weeks or months. In the latter case it's standard practice to just delete the leaked VMs to clear the warning, as often these leaks are a result of normal maintenance.

Nova-fullstack uses the 'admin-monitoring' openstack project. To see the list of leaked VMs:

root@cloudcontrol1003:~# source /usr/local/bin/observerenv.sh
root@cloudcontrol1003:~# OS_PROJECT_ID=admin-monitoring openstack server list
+--------------------------------------+-----------------------+--------+----------------------------------------------------+
| ID                                   | Name                  | Status | Networks                                           |
+--------------------------------------+-----------------------+--------+----------------------------------------------------+
| 5a4eb9eb-0977-4b16-b0a3-b86ca7027771 | fullstackd-1561665242 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.42              |
| 26d3b3cf-7392-455e-8afc-5085762b5045 | fullstackd-1561661295 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.37, 172.16.3.38 |
+--------------------------------------+-----------------------+--------+----------------------------------------------------+

Nova-fullstack also logs to syslog.

nova instance creation test

run:

python3 /usr/local/sbin/nova-fullstack

from reporting node to see error

keyfile cannot be read

@cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 654, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "verify_args", "labels": {"test_hostname": "test-create-20220829232357"}, "message": "keyfile  cannot be read", "process.name": "MainProcess", "process.thread.id": 1143647, "process.thread.name": "MainThread", "timestamp": "2022-08-29T23:23:57.697577"}

TODO: fill in what this means and how to repair it. Unclear on what keyfile.

Supporting component failures

Issues with other components in Cloud VPS and in the deeps of openstack.

instance DNS failure

Fail-over

There are two designate/pdns nodes: labservices1001 and labservices1002. The active node is determined in Hiera by a few settings:

labs_certmanager_hostname: <primary designate host, generally labservices1001>
labs_designate_hostname: <primary designate host>
labs_designate_hostname_secondary: <other designate host, generally labservices1002>

In order to switch to a new primary designate host, change the $labs_designate_hostname and $labs_certmanager_hostname settings. That's not enough, though! Powerdns will reject dns updates from the new server due to it not being the master, which will result in syslog messages like this:

    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for eqiad.wmflabs from 208.80.155.117 which is not a master

To change this, change the master in the pdns database:

$ ssh m5-master.eqiad.wmnet
$ sudo su -
# mysql pdns
MariaDB MISC m5 localhost pdns > select * from domains;
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    | id | name               | master              | last_check | type  | notified_serial | account        | designate_id  
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    |  1 | eqiad.wmflabs      | 208.80.155.117:5354 | 1448252102 | SLAVE |            NULL | noauth-project | 114f1333c2c144
    |  2 | 68.10.in-addr.arpa | 208.80.155.117:5354 | 1448252099 | SLAVE |            NULL | noauth-project | 8d114f3c815b46
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=1;
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=2;

Typically the dns server labs-ns2.wikimedia.org is associated with the primary designate server, and labs-ns3.wikimedia.org with the secondary. You will need to make appropriate hiera changes to modify those as well.

User workflow issues

Issues that affects common usage/workflows by end users.

incorrect quota violations

Nova is not great at tracking resource usage in projects, and sometimes sometimes will fail to reset usage values when resources are freed. That can mean that nova will sometimes report a quota violation and refuse to allocate new resources even though you can see perfectly well that the project isn't actually over quota.

The symptom is typically that new instance creation fails without warning or explanation. The quota error shows up in 'openstack server show <failed instance>':

Failed to allocate the network(s) with error Maximum number of fixed ips exceeded, not rescheduling

Here's an example of what that looks like in the database:

root@MISC m5[nova]> select * from quota_usages where project_id='contintcloud';
+---------------------+---------------------+------------+------+--------------+-----------------+--------+----------+---------------+---------+-----------------+
| created_at          | updated_at          | deleted_at | id   | project_id   | resource        | in_use | reserved | until_refresh | deleted | user_id         |
+---------------------+---------------------+------------+------+--------------+-----------------+--------+----------+---------------+---------+-----------------+
| 2015-01-08 14:48:45 | 2015-01-08 14:48:45 | NULL       | 1218 | contintcloud | security_groups |      0 |        0 |          NULL |       0 | yuvipanda       |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1385 | contintcloud | instances       |     11 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1386 | contintcloud | ram             |  45056 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1387 | contintcloud | cores           |     22 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:33 | 2018-03-03 13:21:36 | NULL       | 1388 | contintcloud | fixed_ips       |     16 |      193 |          NULL |       0 | NULL            |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1389 | contintcloud | instances       |      0 |        0 |          NULL |       0 | hashar          |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1390 | contintcloud | ram             |      0 |        0 |          NULL |       0 | hashar          |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1391 | contintcloud | cores           |      0 |        0 |          NULL |       0 | hashar          |
| 2015-06-17 20:40:55 | 2015-06-17 20:40:55 | NULL       | 1525 | contintcloud | floating_ips    |      0 |        0 |          NULL |       0 | NULL            |
| 2015-10-20 14:23:11 | 2016-08-10 23:59:24 | NULL       | 1812 | contintcloud | security_groups |      0 |        0 |          NULL |       0 | hashar          |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2504 | contintcloud | instances       |      0 |        0 |          NULL |       0 | novaadmin       |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2505 | contintcloud | ram             |      0 |        0 |          NULL |       0 | novaadmin       |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2506 | contintcloud | cores           |      0 |        0 |          NULL |       0 | novaadmin       |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2590 | contintcloud | instances       |      0 |        0 |          NULL |       0 | andrew          |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2591 | contintcloud | ram             |      0 |        0 |          NULL |       0 | andrew          |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2592 | contintcloud | cores           |      0 |        0 |          NULL |       0 | andrew          |
+---------------------+---------------------+------------+------+--------------+-----------------+--------+----------+---------------+---------+-----------------+

If you're sure that this quota violation is in error, it's possible to force nova to re-calculate the usages by setting the value to -1 in the database:

root@MISC m5[nova]> update quota_usages set reserved='-1' where project_id='contintcloud';

or

root@MISC m5[nova]> update quota_usages set in_use='-1' where project_id='contintcloud';

The 'reserved' and 'in_use' values are dynamically generated so it's safe to set them this way; a value of -1 will force a recalculation the next time a quota is checked.

Disable VM scheduling

If we run out of virt capacity or VM creation is causing problems, we can prevent any new VMs from being created by emptying the scheduler pool in hiera:

diff --git a/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml b/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
index c68e789..1394efd 100644
--- a/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
+++ b/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
@@ -35,10 +35,4 @@ profile::openstack::eqiad1::nova::physical_interface_mappings:
 # cloudvirtanXXXX: reserved for gigantic cloud-analytics worker nodes
 #
 #
-profile::openstack::eqiad1::nova::scheduler_pool:
-  - cloudvirt1013
-  - cloudvirt1025
-  - cloudvirt1026
-  - cloudvirt1027
-  - cloudvirt1028
-  - cloudvirt1029
+profile::openstack::eqiad1::nova::scheduler_pool: []

Instance Troubleshooting

Some notes on troubleshooting VMs in Cloud VPS.

Reset state of an instance

You might have to do this if the actual state of the instance doesn't seem to correspond to reality (it says REBOOT or SHUTOFF when it isn't, or vice versa), or if nova isn't responding to any commands at all about a particular instance.

wmcs-openstack server set --state active <uuid>

This changes the state of the instance with the uuid to 'ACTIVE', and hopefull fixes things (or blows up a baby panda, unsure!)

Block Migration

Because we don't use shared storage for instance volumes, true live-migration is not available. Block migration works pretty well, though -- it causes a brief (minute or two) interruption to an instance but does not register as a reboot, and most running services should survive a block migration without any complaint.

This is useful for rebalancing when a compute node is overloaded, or for evacuating instances from a failing node.

On the nova controller (e.g. virt1000):

   source /root/novaenv.sh
   nova live-migration --block-migrate <instanceid> <targethost>

You can check the status of a migrating instance with 'nova show <instanceid>'. Its status will show as 'migrating' until the migration is complete.

NOTE: There is one serious bug in the block-migrate feature in Havana. The migrate process attempts to check quotas on the target node, but ignores overprovision ratios. That means that the nova scheduler will frequently fill a host to the point where it can no longer accept live migrations. Because of this bug it will probably be necessary to keep two empty compute nodes in order to support complete evacuation of any one node.

Recompress a live-migrated instance

In Nova icehouse (and possibly later versions) a block migrate removes the copy-on-write elements of the instance, causing it to take up vastly more space on the new host. The instance can be recompressed if you stop it first (at which point you might as well have used wmcs-cold-migrate in the first place.) Here's an example of recompressing:

 andrew@labvirt1002:~$ sudo su -
 root@labvirt1002:~# cd /var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Make sure that the instance is STOPPED with 'nova stop'
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# mv disk disk.bak
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# qemu-img convert -f qcow2 -O qcow2 disk.bak disk
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Restart instance, make sure it is working.
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# rm disk.bak

Fixing an instance that won't reboot

Occasionally an instance my fail to reboot. You can usually solve this by using reboot via nova, but occasionally that fails as well. You can force a reboot by "destroying" the instance then telling nova to reboot the instance. This causes nova to "create" the instance. Of course, "destroy" and "create" really just kill the kvm process and start the process. You should not "delete" or "terminate" the instance.

To force reboot the instance, do the following:

Figure out which host the instance is running on
Destroy the instance (<instance-id> can be found via virsh list):
virsh destroy <instance-id>
If you see an error like below, then you'll need to restart the libvirt-bin process, then try the destroy
Timed out during operation: cannot acquire state change lock
Tell nova to reboot the instance via "reboot"

Root console access

Most debian VMs have a root console that can be accessed as root on a cloudvirt.

First, determine the host and nova ID of the VM in question. This is available via OpenStack Browser or from 'openstack server show'. Log on to the cloudvirt that hosts the instance, and then:

$ sudo su -
# virsh console <instance-id>
 Connected to domain i-000045b2
 Escape character is ^]

 root@consoletests:~# pwd
 /root

BE AWARE that this console is the same console that is logged to the horizon 'console' tab. Anything you type there will be visible to any project member. Be careful not to type visible passwords or other secure data on the console.

Older VMs have their console on serial1:

$ sudo su -
# virsh console --devname serial1 <instance-id>
 Connected to domain i-000045b2
 Escape character is ^]

 root@consoletests:~# pwd
 /root

Note that the console does not start up until after the firstboot script of a VM has completed; tty setup happens after that.

Use CTRL + 5 to exit the console.

Mounting an instance disk

guestfish

Try this first! The 'guestfish' tool (part of libguestfs-tools) is now installed on all cloudvirts. The following procedure with ceph defaults to read-only access. To make it read-write switch --ro to --rw (which should only even work with the instance shut off.) To mount an instance's filesystem that is stored on ceph:

guestfish with ceph

$ guestfish -d i-0001216c --ro -i

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: ‘help’ for help on commands
      ‘man’ to read the manual
      ‘quit’ to quit the shell

Operating system: 10.6
/dev/sda2 mounted on /
/dev/vd/second-local-disk mounted on /srv

><fs> cat /etc/resolv.conf
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain tools.eqiad.wmflabs
search tools.eqiad.wmflabs eqiad.wmflabs
nameserver 208.80.154.143
nameserver 208.80.154.24
options timeout:1 ndots:1

For read-only access to an instance's filesystem stored locally on a cloudvirt:

$ cd /var/lib/nova/instances/<instanceid>
$ guestfish --ro -i -a ./disk
libguestfs: warning: current user is not a member of the KVM group (group ID 121). This user cannot access /dev/kvm, so libguestfs may run very slowly. It is recommended that you 'chmod 0666 /dev/kvm' or add the current user to the KVM group (you might need to log out and log in again).

Welcome to guestfish, the guest filesystem shell for editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: 9.5
/dev/sda3 mounted on /

><fs> cat /etc/resolv.conf 
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain testlabs.eqiad.wmflabs
search testlabs.eqiad.wmflabs eqiad.wmflabs 
nameserver 208.80.155.118
nameserver 208.80.154.20
options timeout:2 ndots:2

><fs>

For read-write access, first stop the running vm and then as root:

# cd /var/lib/nova/instances/<instanceid>
# guestfish --rw -i -a ./disk

Welcome to guestfish, the guest filesystem shell forediting virtual machine filesystems and disk images.

Type: 'help' for help on commands
     'man' to read the manual
     'quit' to quit the shell

Operating system: 9.7
/dev/sda3 mounted on /

><fs> vi /etc/resolv.conf

guestfish alternatives

This uses nbd and qemu-nbd which is part of the qemu-utils package.

Make sure the nbd kernel module is loaded:

$ sudo modprobe -r nbd
$ sudo modprobe nbd max_part=16

/dev/nbd* should now be present

Mounting a flat file disk with qemu-nbd and accessing a relevant partition

1. Ensure the instance is not running. Otherwise, you may corrupt the disk

   nova stop <instance-id>

2. Change to the instance directory.

   Usually this is /var/lib/nova/instances/<instance-id> or /srv/<instance-id>

3. Connect the disk to the nbd device. Consider using the --read-only flag for read-only.

   qemu-nbd [ --read-only] -c /dev/nbd0 disk

This will create an nbd process for accessing the disk:

   root     29725     1  0 14:00 ?        00:00:00 qemu-nbd -c /dev/nbd0 /srv/b784faf3-9de2-4c4e-9df8-c8e2925bfab9/disk

4. Inspect the disk for partitions. partx -l /dev/nbd0

    1:        34-     2048 (     2015 sectors,      1 MB)
    2:      2049-  1048576 (  1046528 sectors,    535 MB)
    3:   1048577- 40894430 ( 39845854 sectors,  20401 MB)
    4:  40894464- 41940991 (  1046528 sectors,    535 MB)

In this case, partition 3 is the root device.

5. Mount the required partition. mount /dev/nbd0p3 /mnt/instances/

   mount -l | grep nbd
   /dev/nbd0p3 on /mnt/instances type ext4 (ro

   /dev/nbd<device>p<partition number> will directly mount a partition within a disk.

6. Inspect or modify contents of mount

7. Unmount the device. umount /mnt/instances

8. Detach the nbd device (this will terminate the nbd process). qemu-nbd -d /dev/nbd0

Other file mounting scenarios

If a disk an an ext3/4 file:
- qemu-nbd -c /dev/nbd[0-9] <disk>
- mount /dev/nbd[0-9] <mountpoint>
To attach only a certain partition to the nbd device:
- qemu-nbd --partition=<partition-number> -c /dev/nbd[0-9] <disk>
- mount /dev/nbd[0-9] <mountpoint>
If the disk is an LVM volume:
- qemu-nbd -c /dev/nbd[0-9] <disk>
- vgscan
- vgchange -ay
- mount /dev/<volume-group>/<logical-volume> <mountpoint>
If the disk is a new bootstrap_vz build:
- qemu-nbd -c /dev/nbd[0-9] <disk>
- mount /dev/nbd[0-9]p3 /tmp/mnt
If the disk does not have separate partitions for /boot and /
- fdisk -l /dev/nbd0
- Device Boot Start End Sectors Size Id Type
- /dev/nbd0p1 * 2048 41943039 41940992 20G 83 Linux
- (Assuming 512 byte sectors)
- 512 x 2048 (boot start offset) = 1048576
- mount -o offset=1048576 /dev/nbd0 /mnt/instances/
- (otherwise file system headers will not be present at byte 1 and
- mount will fail to mount thinking file system is bogus)

When finished, you should unmount the disk, then disconnect the volume:

If the disk is not an LVM volume:
- umount <mountpoint>
- qemu-nbd -d /dev/nbd[0-9]
If the disk is an LVM volume:
- umount <mountpoint>
- vgchange -an <volume-group>
- qemu-nbd -d /dev/nbd[0-9]

guestmount method

This is another method:

Locate the VM disk:

cloudvirt1020:~ $ for i in $(sudo virsh list --all | grep i.* | awk -F' ' '{print $2}') ; do echo -n "$i " ; sudo virsh dumpxml $i | grep nova:name ; done
i-0000025d       <nova:name>puppettestingui</nova:name>
i-00000262       <nova:name>aptproxy2</nova:name>
i-00000263       <nova:name>t2</nova:name>
i-00000264       <nova:name>t3</nova:name>

Once you know the internal instance name (i-xxxxx), locate the disk file

cloudvirt1020:~ $ sudo virsh dumpxml i-00000264 | grep "source file" | grep disk
      <source file='/var/lib/nova/instances/09865310-b440-4dc7-99ab-fb5f35be04fb/disk'/>

Shutdown the machine (from inside, from horizon, using virsh or whatever)
Copy the disk file to your home

cloudvirt1020:~ $ cp /var/lib/nova/instances/09865310-b440-4dc7-99ab-fb5f35be04fb/disk t3-disk.qcow2

Create a destination directory and mount the disk!

cloudvirt1020:~ $ mkdir mnt ; sudo guestmount -a t3-disk.qcow2 -m /dev/sda3 -o allow_other --rw mnt

The -o allow_other option is needed so other users can access the filesystem (e.g. Nagios will error out if it can't stat a mounted filesystem so this allows check_disk plugin to work correctly).

You can now read/write the instance disk in the mount point
When done, umount, copy back the instance disk and start the instance!

(Re)Setting root password on a mounted disk

   passwd --root </mnt/path> root

Fix VM disk corruption (fsck)

First, you'll need the mount the instance's disk. After doing so, you can simply run an fsck against it.

Some examples:

n case the VM disk requires an external fsck.

method 1: guestfish

This method works with ceph with no modifications.

SSH to the hypervisor running the instance.
Get hypervsor instance name (i-XXXXXXX)
Run following commands:

root@cloudvirt10XX:~# guestfish -d i-00001de2

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell
><fs> run
><fs> fsck ext4 /dev/sda3
0x4

May need to run in several different disks.

method 2: qemu-nbd

SSH to the hypervisor running the instance.
Get nova instance uid (for example bad6cee2-2748-4e14-9dcc-3ccdc28f279e)
Enable required kernel modules:

root@cloudvirt10XX:~# modprobe -r nbd
root@cloudvirt10XX:~# modprobe nbd max_part=16

Run following commands:

root@cloudvirt10XX:~# qemu-nbd --connect=/dev/nbd0 /var/lib/nova/instances/bad6cee2-2748-4e14-9dcc-3ccdc28f279e/disk
root@cloudvirt10XX:~# fdisk -l /dev/nbd0
root@cloudvirt10XX:~# fsck.ext4 /dev/nbd0p3 
root@cloudvirt10XX:~# qemu-nbd --disconnect /dev/nbd0

May need to run in several different disks. If needed, several nbd devices can be used as long as they don't share the number (i.e, /dev/nbd0, /dev/nbd1, etc).

If running this in a Jessie hypervisor for a Stretch VM disk, you may need a newer version of 'e2fsprogs'.

Also, if you want fsck to don't ask questions, you need additional arguments:

root@cloudvirt10XX:~# fsck.ext4 -fy /dev/nbd0p3

Trace a vnet device to an instance

VNET=<vnet-device>
for vm in $(virsh list | grep running | awk '{print $2}')
  do virsh dumpxml $vm|grep -q "$VNET" && echo $vm
done

Get the live virsh config for an instance

virsh dumpxml <instance id>

Get a screenshot of the instance's "screen"

virsh screenshot <instance id>

Send a keypress to the instance's "keyboard"

virsh send-key <instance id> <keycode>

Where keycode is the linux keycode. Most useful is "28" which is an ENTER.

A list of keycodes can be fetched from http://libvirt.org/git/?p=libvirt.git;a=blob_plain;f=src/util/keymaps.csv

Deleting an orphaned build request/VM stuck in BUILD status

Sometimes if rabbitmq loses a message (got restarted or any other issue), an instance might get stuck in "BUILD"-"scheduling" status, in that case the instance will only show up when listing the servers from the cli, like:

root@cloudcontrol1003:~# nova --os-project-id admin-monitoring list
+--------------------------------------+---------------------------+--------+------------+-------------+----------+
| ID                                   | Name                      | Status | Task State | Power State | Networks |
+--------------------------------------+---------------------------+--------+------------+-------------+----------+
| d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD  | scheduling | NOSTATE     |          |
+--------------------------------------+---------------------------+--------+------------+-------------+----------+

To clean it up, you have to remove the build request from the api database (NOTE: This might vary from Openstack version to version, for this one found the tables by grepping a backup file), on Ussuri, you can drop the entry from the two tables "request_specs" and "build_requests", from one of the control nodes:

root@cloudcontrol1003:~# mysql -u root nova_api_eqiad1

mysql:root@localhost [nova_api_eqiad1]> begin;
Query OK, 0 rows affected (0.000 sec)

mysql:root@localhost [nova_api_eqiad1]> delete from request_specs where instance_uuid='d603b2e0-7b8b-462f-b74d-c782c2d34fea';
Query OK, 1 row affected (0.015 sec)

mysql:root@localhost [nova_api_eqiad1]> delete from build_requests where instance_uuid='d603b2e0-7b8b-462f-b74d-c782c2d34fea';
Query OK, 1 row affected (0.001 sec)

mysql:root@localhost [nova_api_eqiad1]> commit;
Query OK, 0 rows affected (0.161 sec)

mysql:root@localhost [nova_api_eqiad1]> Bye

Stuck migration/drain: generic case

Most of the times when a VM gets stuck but there's no errors, you can re-try to migrate it by reseting it's state to active in the original host:

nova reset-state --active <VMID>

And then retrying the drain from the cloudcontrol node:

root@cloudcontrol1003:~# wmcs-drain-hypervisor <HOST_TO_DRAIN>

Stuck migration/drain: certificate issues

If a migration fails and nova-compute logs contain certificate issues, it's related to the libvirtd certificates. See phab:T355067 as a past example.

To troubleshoot you may want to use virsh:

taavi@cloudvirt1060 ~ $ sudo virsh --connect qemu://cloudvirt1046.eqiad.wmnet/system?pkipath=/var/lib/nova

Stuck migration/drain: 'Networking client is experiencing an unauthorized exception.' / VM qemu process started on many/different host than openstack reports

Sometimes during a migration that got stuck, you might see the above log kibana, or when re-trying the migration, it will show up when showing the VM info (note the id and instance_name):

root@cloudvirt1021:~# openstack server show d946b010-3f17-49bb-aba0-4af66bac57a8
...
| OS-EXT-SRV-ATTR:host                | cloudvirt1021                                                                                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1021.eqiad.wmnet                                                                                                   |
| OS-EXT-SRV-ATTR:instance_name       | i-00002caf                                                                                                                  |
...
| id                                  | d946b010-3f17-49bb-aba0-4af66bac57a8                                                                                        |
...
| fault                               | {'code': 400, 'created': '2021-05-03T09:38:13Z', 'message': 'Networking client is experiencing an unauthorized exception.'} |
...

And you might see that the qemu process is either started on another host, or started on the current and another host:

dcaro@cumin1001:~$ sudo cumin 'cloudvirt1*' 'ps aux | grep i-00002caf'  # this is the instance_name
...
(1) cloudvirt1018.eqiad.wmnet
----- OUTPUT of 'ps aux | grep i-00002caf' -----
libvirt+ 12237  147  1.5 10374072 8433860 ?    Sl   09:37 417:21 /usr/bin/qemu-system-x86_64 -name guest=i-00002caf,...
...

In this case, the VM says it's on cloudvirt1021, but the process is running on cloudvirt1018.

NOTE: The following solution will stop the VM, so make sure to depool the VM from the service it runs on (if any, for example if it's an sgeexec node, use this).

Then you go to the original host (the one openstack says the VM is in, in this case cloudvirt1021) and stop the VM:

nova stop d946b010-3f17-49bb-aba0-4af66bac57a8  # this is the vm id

Go to the node where the qemu process is actually running (in our case coludvirt1018), and destroy the domain with virsh:

virsh destroy i-00002caf  # this is the instance_name

Then start the VM again on the original host:

nova start d946b010-3f17-49bb-aba0-4af66bac57a8  # this is the vm id

And once all the VM is started, you can retry the drain from the cloudcontrol node:

root@cloudcontrol1003:~# wmcs-drain-hypervisor cloudvirt1021

No valid host was found. There are not enough hosts available.

This error means that for some constraint/resource reasons it was unable to move that VM, some possibilities:

There's a server group that prevents the VM from moving (ex. strong affinity):

To check this, you can list the server groups in the project, for example for citelearn project:

aborrero@cloudcontrol1005:~ 2s 2 $ sudo wmcs-openstack server group list --os-project-id citelearn
+--------------------------------------+------------+----------+
| ID                                   | Name       | Policies |
+--------------------------------------+------------+----------+
| 5de0d9fd-963c-4b82-9cea-b099b2c1cce1 | Citelearn1 | affinity |
+--------------------------------------+------------+----------+

We see there that there's one affinity group, that's suspicious (they are not allowed currently, see [T276963]), to see the servers in the group:

aborrero@cloudcontrol1005:~$ sudo wmcs-openstack server group show 5de0d9fd-963c-4b82-9cea-b099b2c1cce1 -f yaml
 id: 5de0d9fd-963c-4b82-9cea-b099b2c1cce1
 members: b2539070-4992-438b-9082-8f86078a987a, cdc48393-4ff6-4653-9366-cab57a16c61c,
  d4ae59ff-f526-4bdf-80bd-7eff8d2b1fc8, dd2ba995-d26b-4318-b819-d5bf2f9df919, f7ed61f7-8b48-4a93-8ab9-a2dd9d3a4002
 name: Citelearn1
 policies: affinity

And there you see the servers that, as being in an affinity group, they will have to run on the same host. In this case, we removed the group as we don't allow strong affinity groups (for these same reason):

aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack server group delete 5de0d9fd-963c-4b82-9cea-b099b2c1cce1
aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack server group list --os-project-id citelearn  # empty result

And now retrying will succeed. TODO: If you find more issues, please extend this doc.