Portal:Cloud VPS/Admin/Troubleshooting

From Wikitech
Jump to navigation Jump to search

This page is for troubleshooting urgent issues. Routine maintenance tasks are documented on the VPS_Maintenance page.

Networking failures

Specific issues regarding base networking.

neutron based deployments

The network is deployed following the model/architecture described in the Neutron main article.

neutron agents alive/dead

We detected that in our Mitaka-based deployments there could be some issues with the RabbitMQ setup that results in dropped messages.

One of the most visible effects is neutron agents (l3, linuxbridge, metadata, etc) go dead/alive:

root@cloudcontrol1003:~# watch neutron agent-list
| id                                   | agent_type         | host            | availability_zone | alive | admin_state_up | binary                    |
| 0b2f519f-a5ab-4188-82bf-01431810d55a | DHCP agent         | cloudnet1003    | nova              | xxx   | True           | neutron-dhcp-agent        |
| 1071c198-ed57-4b5a-9439-30e66a31aa69 | Linux bridge agent | cloudvirtan1005 |                   | :-)   | True           | neutron-linuxbridge-agent |
| 28ac0947-f263-4655-98fe-f868325678ae | Linux bridge agent | cloudvirt1015   |                   | :-)   | True           | neutron-linuxbridge-agent |
| 2eeef198-8af7-4e5d-bd73-e14a2a8d2404 | Linux bridge agent | cloudvirtan1004 |                   | :-)   | True           | neutron-linuxbridge-agent |
| 3388792d-560d-4bfe-9054-addf1c239f4a | Linux bridge agent | cloudvirt1027   |                   | :-)   | True           | neutron-linuxbridge-agent |
| 468aef2a-8eb6-4382-abba-bc284efd9fa5 | DHCP agent         | cloudnet1004    | nova              | xxx   | True           | neutron-dhcp-agent        |
| 49b85656-d67b-44c3-ac71-e8c75b849783 | Linux bridge agent | cloudvirt1029   |                   | :-)   | True           | neutron-linuxbridge-agent |
| 4be214c8-76ef-40f8-9d5d-4c344d213311 | L3 agent           | cloudnet1003    | nova              | :-)   | True           | neutron-l3-agent          |
| 5b2a8c8b-3b13-4607-b0bd-460d507f5de1 | Linux bridge agent | cloudvirt1024   |                   | xxx   | True           | neutron-linuxbridge-agent |
| 65f9d324-5126-4336-8f52-001cd0c9fdd1 | Linux bridge agent | cloudvirt1016   |                   | :-)   | True           | neutron-linuxbridge-agent |
| 6dafa3f3-9aeb-47b6-9535-e0932abe4435 | Linux bridge agent | cloudvirt1014   |                   | :-)   | True           | neutron-linuxbridge-agent |

A restart of the RabbitMQ service in the control server should help fix things:

root@cloudcontrol1003:~# systemctl stop rabbitmq-server.service ; systemctl start rabbitmq-server.service

duplicated packets in the network

This is very likely caused because there are 2 neutron l3-agents running in active mode.

Symptoms are usually:

  • asymmetric routing
  • DUP! pings
  • unreachable bastions and/or instances

Can be easily checked running this command:

root@cloudcontrol1003:~# neutron l3-agent-list-hosting-router cloudinstances2b-gw
| id                                   | host         | admin_state_up | alive | ha_state |
| 8af5d8a1-2e29-40e6-baf0-3cd79a7ac77b | cloudnet1003 | True           | :-)   | active   |
| 970df1d1-505d-47a4-8d35-1b13c0dfe098 | cloudnet1004 | True           | :-)   | standby  |

If two of them are active, then you have to disable one, following the procedured described in details in the procedure for manual routing failover, but summarized here (try them in order until you get to a working state):

  • root@cloudcontrol1003:~# neutron agent-update <l3-agent uuid> --admin-state-down
  • user@cloudnet1004:~$ sudo systemctl stop neutron-metadata-agent.service neutron-dhcp-agent.service neutron-l3-agent.service neutron-linuxbridge-agent.service
  • user@cloudnet1004:~$ sudo reboot

network down for VMs in a cloudvirt (the agents case)

We observed that from time to time, neutron agents can be in a crazy loop of going up/down, resyncing with the server and then going down again.

Symptoms are usually:

  • one or more cloudvirt lost connectivity for the VMs running on it, in cycles (i.e, network is working, and next minute is not).
  • networking in CloudVPS is inconsistent in general.
  • neutron agents going alive/dead randomly

Unfortunately, we don't know a solution for this other than restart stuff and hope it comes alive well. We observed some issues in message queues (rabbitmq), which points towards dropped or unreplied messages. To counter this, we tunned heartbeat messages and timeouts trying to get the smarter version of neutron we can. This is probably a bug in the Mitaka suite and fixed in future versions.

See Neutron agents alive/dead.

network down for VMs in a cloudvirt (the VLAN case)

The other common case is issues with the physical switches in the rack. Each cloudvirt/cloudnet box need a concrete VLAN/trunking configuration in each interface to work. Also, switches need to be trunked between them, so VLANs are connected between switches.

Symptoms are usually:

  • there has been some work recently on switches or physical equipment near this server
  • there has been a rename/install/reimage related to this server
  • if you do a tcpdump in the affected NIC (usually eth1), you will see packets flowing in outbound direction only

Please talk to DC-ops or Traffict them to re-check switching configuration, as described in the main Neutron article.

VM cannot communicate with other VMs in different cloudvirt (the ARP table case)

This only happened once so far but it's documented here just in case it happens again (phab:T209166).

If a VM can communicate with everything except for a few other VMs running in a different cloudvirt, that could mean the physical switch that is connecting the cloudvirts might be confused about where to forward packets. This is a pure L2 networking issue.

To confirm this is really the case, make a list of all the other VMs that this VM cannot talk to. Narrow it down to the cloudvirt(s) that are experiencing this issue.

Next, use tcpdump at each layer to confirm that packets are leaving one VMs and not making their way to the destination VM (maybe while ping is running between both VMs):

  • Source VM: tcpdump -n -i eth0 icmp or arp
  • Source cloudvirt bridge: tcpdump -n -e -i br* '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
  • Source cloudvirt eth1.1105 ("wire"): tcpdump -n -e -i eth1.1105 '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'

And then on the destination cloudvirt:

  • Destination cloudvirt eth1.1105 ("wire"): tcpdump -n -e -i eth1.1105 '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
  • Destination cloudvirt bridge: tcpdump -n -e -i br* '(ether host $SOURCE_VM_MAC_ADDR or ether host $DEST_VM_MAC_ADDR) and (icmp or arp)'
  • Destination VM: tcpdump -n -i eth0 icmp or arp

Hopefully, the tcpdump rules will prevent too many packets from flying through the screen and you'll be able to tell where things are breaking.

You can also run tcpdump on the tap-* interface that belongs to each VM (use virsh dumpxml to discover the instance/tap/instance_name combination).

If you see that packets are leaving the source VM, arriving on the destination VM, being replied by the destination VM but never back it back to the source VM; that could this issue.

In that case, contact the networking staff and ask them to remove the source/destination MAC addresses from the switch's ARP table (even if they look correct). For example, <asw2-b-eqiad> clear ethernet-switching table fa:16:3e:91:a2:93

Network communication between cloudvirts involves:

  • Each VM being a QEMU process and having a tap-* to talk to the outside world
  • This tap-* interface being added to a bridge on the cloudvirt
  • The VLAN 1105 interface being part of the bridge as well (so packets can leave the cloudvirt and go to other cloudvirts)
  • The switch between cloudvirts
  • OpenStack security groups (that get translated to iptables rules)

Notice it doesn't involve the cloudnet hosts because communication is internal to the cluster.

nova-network based deployments

The network node is either labnet1001 or labnet1002, running the nova-network service. Only one is active at a time; the other is an inactive failover. Before taking any of these steps, check in the site.pp in puppet to see with node is active -- role::nova::network is commented out on the standby node.


  • ssh connections to instances display an unexpected host-key warning
  • In some cases when the nova-network service is down, traffic bound for a Cloud VPS instance instead hits the network node directly. In this case, ssh will try to log you in to the node itself rather than the instance behind the node. This gets you a host-key (or userkey) failure.
  • All Cloud VPS instances unreachable
  • Web services running on multiple instances fail at the same time


  • Restart nova-network on the active network node (this works surprisingly often)
 service nova-network restart
  • Check iptables and try to figure out what's happening


If the active network node is completely dead, you'll need to switch the network service to the backup node. Note that this switch-over WILL cause network downtime for Cloud VPS, so outside of an emergency don't do this without scheduling a window in advance.

This switchover requires you to muck about in the nova database. At the moment, this database is hosted on m5-master, aka db1073. You can access the database like so:

 $ sudo su -
 # mysql nova --skip-ssl
  • disable alerting for pretty much everything about Cloud VPS hosts
  • stop puppet on both network nodes (old and new)
  • merge a puppet patch adding labs::openstack::nova::network to the new network host and removing it from the old network host, and updating hiera settings about the active network node.
  • change the network record (today, newhostname was 'labnet1002')
  • This is probably how new floating IPs know what to set their host as.
   MariaDB MISC m5 localhost nova > select * from networks\G
   # note network record id, in this case it is '2'
   MariaDB MISC m5 localhost nova > update networks set host = '<newhostname>' where id=2;
  • reassign floating IPs to the new network host (again, today newhostname was 'labnet1002')
  • This is how a given network node knows to set up natting for each floating ip
   MariaDB MISC m5 localhost nova > update floating_ips set host = '<newhostname>' where host = '<oldhostname>';
  • release on the old network host
  • Shutdown the active br01 interface (Openstack will not migrate the IP while in use)
  • ifconfig br1102 down
  • Enable puppet, run puppet, (re)start nova-network on the new network host
   $ sudo service nova-network restart
  • Enable puppet, run puppet on the old network host
  • verify that the new network host has grabbed the gateway IP
  • ip addr show and verify the gateway IP has migrated
  • verify that floating IPs have moved over to the new host
   $ sudo iptables -t nat -L -n
  • change routing so that floating IPs are routed to the new host
  • On both cr1 and cr2 (Not the next-hop should reflect active node -- this shows labnet1002:
  • delete routing-options static route next-hop
  • delete routing-options static route next-hop
  • set routing-options static route next-hop
  • set routing-options static route next-hop
  • restart and then stop nova-network on the old node, just so it knows it's not responsible anymore
   $ sudo service nova-network restart
   $ sudo service nova-network stop
  • restart keystone on labcontrol1001. I don't know why, but it died while we did all this.
   $ sudo service keystone restart
  • Presuming the new network host includes the nova api, add the new network host to the keystone endpoints and remove the failed host
openstack endpoint create --region eqiad compute public http://<newhostname>.eqiad.wmnet:8774/v2.1
openstack endpoint create --region eqiad compute internal http://<newhostname>.eqiad.wmnet:8774/v2.1
openstack endpoint create --region eqiad compute admin http://<newhostname>.eqiad.wmnet:8774/v2.1
openstack endpoint delete <old endpoints>
  • re-enable alerting


Galera related troubleshooting.

Galera won't start up

Galera runs an instance on each of the cloudcontrol nodes. If any one of them restarts it should be able to rejoin the existing cluster without trouble. If the whole cluster is down, galera will refuse to start up again to avoid split-brain. This manifests as mysqld not starting, and a log message like:

root@cloudcontrol2001-dev:~# journalctl -u mariadb -f
WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)

To restart the cluster from scratch (from the primary node, generally the lowest-number cloudcontrol), first tell Galera that it's ok to create a new cluster by editing /var/lib/mysql/grastate.dat:

safe_to_bootstrap: 1

Then, bootstrap the new cluster:

root@cloudcontrol2001-dev:~# galera_new_cluster
root@cloudcontrol2001-dev:~# systemctl start mariadb

Finally, make sure that we can't re-bootstrap on top of the now-running cluster by changing /var/lib/mysql/grastate.dat back to how it was:

safe_to_bootstrap: 0

You can now check that mariadb starts correctly in the whole cluster:

user@cumin2001:~ $ sudo cumin --force -x A:cloudcontrol-codfw1dev 'systemctl status mariadb'
3 hosts will be targeted:
===== NODE GROUP =====                           

Galera backup recovery

In case of complete catastrophe, the DBs hosted on Galera are backed up daily by Bacula. Backups can be recovered following the Bacula#Restore_(aka_Panic_mode) guide. Restoring will provide you with a file.sql.gz dump of each database.

Backups are run on each node. If everything has been going well, the backups should be almost entirely redundant; when restoring choose the most-recently-backed-up host. If, on the other hand, we've suffered a split-brain, you will need to make a choice about which node to choose as the winner, i.e, the backup from which node to restore.

Once the mysql service is running in at least one server, follow the example commands to restore a given database:

root@cloudcontrol2001-dev:~# gunzip /var/tmp/bacula-restores/srv/backups/glance-202007030408.sql.gz
root@cloudcontrol2003-dev:~# mysql -u root -e "CREATE DATABASE IF NOT EXISTS glance;"
root@cloudcontrol2001-dev:~# mysql -u root glance < /var/tmp/bacula-restores/srv/backups/glance-202007030408.sql
root@cloudcontrol2003-dev:~# mysql -u root -e "USE glance; SHOW TABLES;"
| Tables_in_glance                 |
| alembic_version                  |
| image_locations                  |
| image_members                    |
| image_properties                 |
| image_tags                       |
| images                           |
| metadef_namespace_resource_types |
| metadef_namespaces               |
| metadef_objects                  |
| metadef_properties               |
| metadef_resource_types           |
| metadef_tags                     |
| migrate_version                  |
| task_info                        |
| tasks                            |


Nova-fullstack is a testing agent that runs on one of the nova nodes, currently cloudcontrol1003. It periodically creates a new VM, waits for it to launch and run puppet, then verifies the dns record and tests ssh access.

If a VM launches but is for any reason unreachable after the specified timeout, fullstack gives up and leaves the VM behind for future research. If a large number (currently 6) VMs are leaked like this, an alert will fire. Icinga will raise a warning if a smaller number of leaks are detected. Note that these leaked instances might be the result of an ongoing failure, or they might be evidence of a rare/intermittent failure that has accumulated leaks over many weeks or months. In the latter case it's standard practice to just delete the leaked VMs to clear the warning, as often these leaks are a result of normal maintenance.

Nova-fullstack uses the 'admin-monitoring' openstack project. To see the list of leaked VMs:

root@cloudcontrol1003:~# source /usr/local/bin/observerenv.sh
root@cloudcontrol1003:~# OS_PROJECT_ID=admin-monitoring openstack server list
| ID                                   | Name                  | Status | Networks                                           |
| 5a4eb9eb-0977-4b16-b0a3-b86ca7027771 | fullstackd-1561665242 | ACTIVE | lan-flat-cloudinstances2b=              |
| 26d3b3cf-7392-455e-8afc-5085762b5045 | fullstackd-1561661295 | ACTIVE | lan-flat-cloudinstances2b=, |

Nova-fullstack also logs to syslog.

Supporting component failures

Issues with other components in Cloud VPS and in the deeps of openstack.

instance DNS failure


There are two designate/pdns nodes: labservices1001 and labservices1002. The active node is determined in Hiera by a few settings:

labs_certmanager_hostname: <primary designate host, generally labservices1001>
labs_designate_hostname: <primary designate host>
labs_designate_hostname_secondary: <other designate host, generally labservices1002>

In order to switch to a new primary designate host, change the $labs_designate_hostname and $labs_certmanager_hostname settings. That's not enough, though! Powerdns will reject dns updates from the new server due to it not being the master, which will result in syslog messages like this:

    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for eqiad.wmflabs from which is not a master

To change this, change the master in the pdns database:

$ ssh m5-master.eqiad.wmnet
$ sudo su -
# mysql pdns
MariaDB MISC m5 localhost pdns > select * from domains;
    | id | name               | master              | last_check | type  | notified_serial | account        | designate_id  
    |  1 | eqiad.wmflabs      | | 1448252102 | SLAVE |            NULL | noauth-project | 114f1333c2c144
    |  2 | 68.10.in-addr.arpa | | 1448252099 | SLAVE |            NULL | noauth-project | 8d114f3c815b46
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=1;
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=2;

Typically the dns server labs-ns2.wikimedia.org is associated with the primary designate server, and labs-ns3.wikimedia.org with the secondary. You will need to make appropriate hiera changes to modify those as well.

User workflow issues

Issues that affects common usage/workflows by end users.

incorrect quota violations

Nova is not great at tracking resource usage in projects, and sometimes sometimes will fail to reset usage values when resources are freed. That can mean that nova will sometimes report a quota violation and refuse to allocate new resources even though you can see perfectly well that the project isn't actually over quota.

The symptom is typically that new instance creation fails without warning or explanation. The quota error shows up in 'openstack server show <failed instance>':

Failed to allocate the network(s) with error Maximum number of fixed ips exceeded, not rescheduling

Here's an example of what that looks like in the database:

root@MISC m5[nova]> select * from quota_usages where project_id='contintcloud';
| created_at          | updated_at          | deleted_at | id   | project_id   | resource        | in_use | reserved | until_refresh | deleted | user_id         |
| 2015-01-08 14:48:45 | 2015-01-08 14:48:45 | NULL       | 1218 | contintcloud | security_groups |      0 |        0 |          NULL |       0 | yuvipanda       |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1385 | contintcloud | instances       |     11 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1386 | contintcloud | ram             |  45056 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:10 | 2018-03-03 13:21:48 | NULL       | 1387 | contintcloud | cores           |     22 |        0 |          NULL |       0 | nodepoolmanager |
| 2015-04-21 09:26:33 | 2018-03-03 13:21:36 | NULL       | 1388 | contintcloud | fixed_ips       |     16 |      193 |          NULL |       0 | NULL            |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1389 | contintcloud | instances       |      0 |        0 |          NULL |       0 | hashar          |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1390 | contintcloud | ram             |      0 |        0 |          NULL |       0 | hashar          |
| 2015-04-21 10:51:38 | 2017-09-05 10:25:14 | NULL       | 1391 | contintcloud | cores           |      0 |        0 |          NULL |       0 | hashar          |
| 2015-06-17 20:40:55 | 2015-06-17 20:40:55 | NULL       | 1525 | contintcloud | floating_ips    |      0 |        0 |          NULL |       0 | NULL            |
| 2015-10-20 14:23:11 | 2016-08-10 23:59:24 | NULL       | 1812 | contintcloud | security_groups |      0 |        0 |          NULL |       0 | hashar          |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2504 | contintcloud | instances       |      0 |        0 |          NULL |       0 | novaadmin       |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2505 | contintcloud | ram             |      0 |        0 |          NULL |       0 | novaadmin       |
| 2016-12-01 14:01:50 | 2016-12-01 14:45:20 | NULL       | 2506 | contintcloud | cores           |      0 |        0 |          NULL |       0 | novaadmin       |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2590 | contintcloud | instances       |      0 |        0 |          NULL |       0 | andrew          |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2591 | contintcloud | ram             |      0 |        0 |          NULL |       0 | andrew          |
| 2017-02-19 01:30:55 | 2017-02-19 01:37:29 | NULL       | 2592 | contintcloud | cores           |      0 |        0 |          NULL |       0 | andrew          |

If you're sure that this quota violation is in error, it's possible to force nova to re-calculate the usages by setting the value to -1 in the database:

root@MISC m5[nova]> update quota_usages set reserved='-1' where project_id='contintcloud';


root@MISC m5[nova]> update quota_usages set in_use='-1' where project_id='contintcloud';

The 'reserved' and 'in_use' values are dynamically generated so it's safe to set them this way; a value of -1 will force a recalculation the next time a quota is checked.

Disable VM scheduling

If we run out of virt capacity or VM creation is causing problems, we can prevent any new VMs from being created by emptying the scheduler pool in hiera:

diff --git a/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml b/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
index c68e789..1394efd 100644
--- a/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
+++ b/hieradata/eqiad/profile/openstack/eqiad1/nova.yaml
@@ -35,10 +35,4 @@ profile::openstack::eqiad1::nova::physical_interface_mappings:
 # cloudvirtanXXXX: reserved for gigantic cloud-analytics worker nodes
-  - cloudvirt1013
-  - cloudvirt1025
-  - cloudvirt1026
-  - cloudvirt1027
-  - cloudvirt1028
-  - cloudvirt1029
+profile::openstack::eqiad1::nova::scheduler_pool: []

Instance Troubleshooting

Some notes on troubleshooting VMs in Cloud VPS.

Reset state of an instance

You might have to do this if the actual state of the instance doesn't seem to correspond to reality (it says REBOOT or SHUTOFF when it isn't, or vice versa), or if nova isn't responding to any commands at all about a particular instance.

nova reset-state --active <uuid>

This changes the state of the instance with the uuid to 'ACTIVE', and hopefull fixes things (or blows up a baby panda, unsure!)

Block Migration

Because we don't use shared storage for instance volumes, true live-migration is not available. Block migration works pretty well, though -- it causes a brief (minute or two) interruption to an instance but does not register as a reboot, and most running services should survive a block migration without any complaint.

This is useful for rebalancing when a compute node is overloaded, or for evacuating instances from a failing node.

On the nova controller (e.g. virt1000):

   source /root/novaenv.sh
   nova live-migration --block-migrate <instanceid> <targethost>

You can check the status of a migrating instance with 'nova show <instanceid>'. Its status will show as 'migrating' until the migration is complete.

NOTE: There is one serious bug in the block-migrate feature in Havana. The migrate process attempts to check quotas on the target node, but ignores overprovision ratios. That means that the nova scheduler will frequently fill a host to the point where it can no longer accept live migrations. Because of this bug it will probably be necessary to keep two empty compute nodes in order to support complete evacuation of any one node.

Recompress a live-migrated instance

In Nova icehouse (and possibly later versions) a block migrate removes the copy-on-write elements of the instance, causing it to take up vastly more space on the new host. The instance can be recompressed if you stop it first (at which point you might as well have used wmcs-cold-migrate in the first place.) Here's an example of recompressing:

 andrew@labvirt1002:~$ sudo su -
 root@labvirt1002:~# cd /var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Make sure that the instance is STOPPED with 'nova stop'
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# mv disk disk.bak
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# qemu-img convert -f qcow2 -O qcow2 disk.bak disk
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Restart instance, make sure it is working.
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# rm disk.bak

Fixing an instance that won't reboot

Occasionally an instance my fail to reboot. You can usually solve this by using reboot via nova, but occasionally that fails as well. You can force a reboot by "destroying" the instance then telling nova to reboot the instance. This causes nova to "create" the instance. Of course, "destroy" and "create" really just kill the kvm process and start the process. You should not "delete" or "terminate" the instance.

To force reboot the instance, do the following:

  1. Figure out which host the instance is running on
  2. Destroy the instance (<instance-id> can be found via virsh list):
    virsh destroy <instance-id>
  3. If you see an error like below, then you'll need to restart the libvirt-bin process, then try the destroy
    Timed out during operation: cannot acquire state change lock
  4. Tell nova to reboot the instance via "reboot"

Root console access

Most debian VMs have a root console that can be accessed as root on a cloudvirt.

First, determine the host and nova ID of the VM in question. This is available via OpenStack Browser or from 'openstack server show'. Log on to the cloudvirt that hosts the instance, and then:

$ sudo su -
# virsh console <instance-id>
 Connected to domain i-000045b2
 Escape character is ^]

 root@consoletests:~# pwd

BE AWARE that this console is the same console that is logged to the horizon 'console' tab. Anything you type there will be visible to any project member. Be careful not to type visible passwords or other secure data on the console.

Older VMs have their console on serial1:

$ sudo su -
# virsh console --devname serial1 <instance-id>
 Connected to domain i-000045b2
 Escape character is ^]

 root@consoletests:~# pwd

Note that the console does not start up until after the firstboot script of a VM has completed; tty setup happens after that.

Use CTRL + 5 to exit the console.

Mounting an instance disk


Try this first! The 'guestfish' tool (part of libguestfs-tools) is now installed on all cloudvirts.

For read-only access to an instance's filesystem on a cloudvirt:

$ cd /var/lib/nova/instances/<instanceid>
$ guestfish --ro -i -a ./disk
libguestfs: warning: current user is not a member of the KVM group (group ID 121). This user cannot access /dev/kvm, so libguestfs may run very slowly. It is recommended that you 'chmod 0666 /dev/kvm' or add the current user to the KVM group (you might need to log out and log in again).

Welcome to guestfish, the guest filesystem shell for editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: 9.5
/dev/sda3 mounted on /

><fs> cat /etc/resolv.conf 
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain testlabs.eqiad.wmflabs
search testlabs.eqiad.wmflabs eqiad.wmflabs 
options timeout:2 ndots:2


For read-write access, first stop the running vm and then as root:

# cd /var/lib/nova/instances/<instanceid>
# guestfish --rw -i -a ./disk

Welcome to guestfish, the guest filesystem shell forediting virtual machine filesystems and disk images.

Type: 'help' for help on commands
     'man' to read the manual
     'quit' to quit the shell

Operating system: 9.7
/dev/sda3 mounted on /

><fs> vi /etc/resolv.conf

guestfish alternatives

This uses nbd and qemu-nbd which is part of the qemu-utils package.

Make sure the nbd kernel module is loaded:

$ sudo modprobe -r nbd
$ sudo modprobe nbd max_part=16

/dev/nbd* should now be present

Mounting a flat file disk with qemu-nbd and accessing a relevant partition

1. Ensure the instance is not running. Otherwise, you may corrupt the disk

   nova stop <instance-id>

2. Change to the instance directory.

   Usually this is /var/lib/nova/instances/<instance-id> or /srv/<instance-id>

3. Connect the disk to the nbd device. Consider using the --read-only flag for read-only.

   qemu-nbd [ --read-only] -c /dev/nbd0 disk

This will create an nbd process for accessing the disk:

   root     29725     1  0 14:00 ?        00:00:00 qemu-nbd -c /dev/nbd0 /srv/b784faf3-9de2-4c4e-9df8-c8e2925bfab9/disk

4. Inspect the disk for partitions. partx -l /dev/nbd0

    1:        34-     2048 (     2015 sectors,      1 MB)
    2:      2049-  1048576 (  1046528 sectors,    535 MB)
    3:   1048577- 40894430 ( 39845854 sectors,  20401 MB)
    4:  40894464- 41940991 (  1046528 sectors,    535 MB)

In this case, partition 3 is the root device.

5. Mount the required partition. mount /dev/nbd0p3 /mnt/instances/

   mount -l | grep nbd
   /dev/nbd0p3 on /mnt/instances type ext4 (ro
   /dev/nbd<device>p<partition number> will directly mount a partition within a disk.

6. Inspect or modify contents of mount

7. Unmount the device. umount /mnt/instances

8. Detach the nbd device (this will terminate the nbd process). qemu-nbd -d /dev/nbd0

Other file mounting scenarios
  1. If a disk an an ext3/4 file:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9] <mountpoint>
  2. To attach only a certain partition to the nbd device:
    • qemu-nbd --partition=<partition-number> -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9] <mountpoint>
  3. If the disk is an LVM volume:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • vgscan
    • vgchange -ay
    • mount /dev/<volume-group>/<logical-volume> <mountpoint>
  4. If the disk is a new bootstrap_vz build:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9]p3 /tmp/mnt
  5. If the disk does not have separate partitions for /boot and /
    • fdisk -l /dev/nbd0
    • Device Boot Start End Sectors Size Id Type
    • /dev/nbd0p1 * 2048 41943039 41940992 20G 83 Linux
    • (Assuming 512 byte sectors)
    • 512 x 2048 (boot start offset) = 1048576
    • mount -o offset=1048576 /dev/nbd0 /mnt/instances/
    • (otherwise file system headers will not be present at byte 1 and
    • mount will fail to mount thinking file system is bogus)

When finished, you should unmount the disk, then disconnect the volume:

  1. If the disk is not an LVM volume:
    • umount <mountpoint>
    • qemu-nbd -d /dev/nbd[0-9]
  2. If the disk is an LVM volume:
    • umount <mountpoint>
    • vgchange -an <volume-group>
    • qemu-nbd -d /dev/nbd[0-9]
guestmount method

This is another method:

  • Locate the VM disk:
cloudvirt1020:~ $ for i in $(sudo virsh list --all | grep i.* | awk -F' ' '{print $2}') ; do echo -n "$i " ; sudo virsh dumpxml $i | grep nova:name ; done
i-0000025d       <nova:name>puppettestingui</nova:name>
i-00000262       <nova:name>aptproxy2</nova:name>
i-00000263       <nova:name>t2</nova:name>
i-00000264       <nova:name>t3</nova:name>
  • Once you know the internal instance name (i-xxxxx), locate the disk file
cloudvirt1020:~ $ sudo virsh dumpxml i-00000264 | grep "source file" | grep disk
      <source file='/var/lib/nova/instances/09865310-b440-4dc7-99ab-fb5f35be04fb/disk'/>
  • Shutdown the machine (from inside, from horizon, using virsh or whatever)
  • Copy the disk file to your home
cloudvirt1020:~ $ cp /var/lib/nova/instances/09865310-b440-4dc7-99ab-fb5f35be04fb/disk t3-disk.qcow2
  • Create a destination directory and mount the disk!
cloudvirt1020:~ $ mkdir mnt ; sudo guestmount -a t3-disk.qcow2 -m /dev/sda3 -o allow_other --rw mnt

The -o allow_other option is needed so other users can access the filesystem (e.g. Nagios will error out if it can't stat a mounted filesystem so this allows check_disk plugin to work correctly).

  • You can now read/write the instance disk in the mount point
  • When done, umount, copy back the instance disk and start the instance!

(Re)Setting root password on a mounted disk

   passwd --root </mnt/path> root

Fix VM disk corruption (fsck)

First, you'll need the mount the instance's disk. After doing so, you can simply run an fsck against it.

Some examples:

n case the VM disk requires an external fsck.

method 1: guestfish

  • SSH to the hypervisor running the instance.
  • Get hypervsor instance name (i-XXXXXXX)
  • Run following commands:
root@cloudvirt10XX:~# guestfish -d i-00001de2

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell
><fs> run
><fs> fsck ext4 /dev/sda3

May need to run in several different disks.

method 2: qemu-nbd

  • SSH to the hypervisor running the instance.
  • Get nova instance uid (for example bad6cee2-2748-4e14-9dcc-3ccdc28f279e)
  • Enable required kernel modules:
root@cloudvirt10XX:~# modprobe -r nbd
root@cloudvirt10XX:~# modprobe nbd max_part=16
  • Run following commands:
root@cloudvirt10XX:~# qemu-nbd --connect=/dev/nbd0 /var/lib/nova/instances/bad6cee2-2748-4e14-9dcc-3ccdc28f279e/disk
root@cloudvirt10XX:~# fdisk -l /dev/nbd0
root@cloudvirt10XX:~# fsck.ext4 /dev/nbd0p3 
root@cloudvirt10XX:~# qemu-nbd --disconnect /dev/nbd0

May need to run in several different disks. If needed, several nbd devices can be used as long as they don't share the number (i.e, /dev/nbd0, /dev/nbd1, etc).

If running this in a Jessie hypervisor for a Stretch VM disk, you may need a newer version of 'e2fsprogs'.

Also, if you want fsck to don't ask questions, you need additional arguments:

root@cloudvirt10XX:~# fsck.ext4 -fy /dev/nbd0p3

Trace a vnet device to an instance

for vm in $(virsh list | grep running | awk '{print $2}')
  do virsh dumpxml $vm|grep -q "$VNET" && echo $vm

Get the live virsh config for an instance

virsh dumpxml <instance id>

Get a screenshot of the instance's "screen"

virsh screenshot <instance id>

Send a keypress to the instance's "keyboard"

virsh send-key <instance id> <keycode>

Where keycode is the linux keycode. Most useful is "28" which is an ENTER.

A list of keycodes can be fetched from http://libvirt.org/git/?p=libvirt.git;a=blob_plain;f=src/util/keymaps.csv