Portal:Cloud VPS/Admin/Troubleshooting

From Wikitech
Jump to: navigation, search

This page is for troubleshooting urgent issues. Routine maintenance tasks are documented on the VPS_Maintenance page.

Network node failure

The network node is either labnet1001 or labnet1002, running the nova-network service. Only one is active at a time; the other is an inactive failover. Before taking any of these steps, check in the site.pp in puppet to see with node is active -- role::nova::network is commented out on the standby node.

Symptoms

  • ssh connections to instances display an unexpected host-key warning
  • In some cases when the nova-network service is down, traffic bound for a labs instance instead hits the network node directly. In this case, ssh will try to log you in to the node itself rather than the instance behind the node. This gets you a host-key (or userkey) failure.
  • All labs instances unreachable
  • Web services running on multiple instances fail at the same time

Treatments

  • Restart nova-network on the active network node (this works surprisingly often)
 service nova-network restart
  • Check iptables and try to figure out what's happening

Fail-over

If the active network node is completely dead, you'll need to switch the network service to the backup node. Note that this switch-over WILL cause network downtime for labs, so outside of an emergency don't do this without scheduling a window in advance.

This switchover requires you to muck about in the nova database. At the moment, this database is hosted on m5-master, aka db1009. You can access the database like so:

 $ sudo su -
 # mysql nova
  • disable alerting for pretty much everything about labs hosts
  • stop puppet on both network nodes (old and new)
  • merge a puppet patch adding labs::openstack::nova::network to the new network host and removing it from the old network host, and updating hiera settings about the active network node.
  • change the network record (today, newhostname was 'labnet1002')
  • This is probably how new floating IPs know what to set their host as.
   MariaDB MISC m5 localhost nova > select * from networks\G
   # note network record id, in this case it is '2'
   MariaDB MISC m5 localhost nova > update networks set host = '<newhostname>' where id=2;
  • reassign floating IPs to the new network host (again, today newhostname was 'labnet1002')
  • This is how a given network node knows to set up natting for each floating ip
   MariaDB MISC m5 localhost nova > update floating_ips set host = '<newhostname>' where host = '<oldhostname>';
  • release 10.68.16.1 on the old network host
  • Shutdown the active br01 interface (Openstack will not migrate the IP while in use)
  • ifconfig br1102 down
  • Enable puppet, run puppet, (re)start nova-network on the new network host
   $ sudo service nova-network restart
  • Enable puppet, run puppet on the old network host
  • verify that the new network host has grabbed the gateway IP
  • ip addr show and verify the gateway IP has migrated
  • verify that floating IPs have moved over to the new host
   $ sudo iptables -t nat -L -n
  • change routing so that floating IPs are routed to the new host
  • On both cr1 and cr2 (Not the next-hop should reflect active node -- this shows labnet1002:
  • delete routing-options static route 208.80.155.128/25 next-hop 10.64.20.13
  • delete routing-options static route 10.68.16.0/21 next-hop 10.64.20.13
  • set routing-options static route 208.80.155.128/25 next-hop 10.64.20.25
  • set routing-options static route 10.68.16.0/21 next-hop 10.64.20.25
  • restart and then stop nova-network on the old node, just so it knows it's not responsible anymore
   $ sudo service nova-network restart
   $ sudo service nova-network stop
  • restart keystone on labcontrol1001. I don't know why, but it died while we did all this.
   $ sudo service keystone restart
  • Presuming the new network host includes the nova api, add the new network host to the keystone endpoints and remove the failed host
openstack endpoint create --region eqiad compute public http://labnet1001.eqiad.wmnet:8774/v2/\$\(tenant_id\)s
openstack endpoint create --region eqiad compute internal http://labnet1001.eqiad.wmnet:8774/v2/\$\(tenant_id\)s
openstack endpoint create --region eqiad compute admin http://labnet1001.eqiad.wmnet:8774/v2/\$\(tenant_id\)s
openstack endpoint delete <old endpoints>
  • re-enable alerting

instance DNS failure

Fail-over

There are two designate/pdns nodes: labservices1001 and labservices1002. The active node is determined in Hiera by a few settings:

labs_certmanager_hostname: <primary designate host, generally labservices1001>
labs_designate_hostname: <primary designate host>
labs_designate_hostname_secondary: <other designate host, generally labservices1002>

In order to switch to a new primary designate host, change the $labs_designate_hostname and $labs_certmanager_hostname settings. That's not enough, though! Powerdns will reject dns updates from the new server due to it not being the master, which will result in syslog messages like this:

    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
    Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for eqiad.wmflabs from 208.80.155.117 which is not a master

To change this, change the master in the pdns database:

$ ssh m5-master.eqiad.wmnet
$ sudo su -
# mysql pdns
MariaDB MISC m5 localhost pdns > select * from domains;
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    | id | name               | master              | last_check | type  | notified_serial | account        | designate_id  
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    |  1 | eqiad.wmflabs      | 208.80.155.117:5354 | 1448252102 | SLAVE |            NULL | noauth-project | 114f1333c2c144
    |  2 | 68.10.in-addr.arpa | 208.80.155.117:5354 | 1448252099 | SLAVE |            NULL | noauth-project | 8d114f3c815b46
    +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=1;
    MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=2;

Typically the dns server labs-ns2.wikimedia.org is associated with the primary designate server, and labs-ns3.wikimedia.org with the secondary. You will need to make appropriate hiera changes to modify those as well.

Specific Instance Troubleshooting

Reset state of an instance

You might have to do this if the actual state of the instance doesn't seem to correspond to reality (it says REBOOT or SHUTOFF when it isn't, or vice versa), or if nova isn't responding to any commands at all about a particular instance.

nova reset-state --active <uuid>

This changes the state of the instance with the uuid to 'ACTIVE', and hopefull fixes things (or blows up a baby panda, unsure!)

Block Migration

Because we don't use shared storage for instance volumes, true live-migration is not available. Block migration works pretty well, though -- it causes a brief (minute or two) interruption to an instance but does not register as a reboot, and most running services should survive a block migration without any complaint.

This is useful for rebalancing when a compute node is overloaded, or for evacuating instances from a failing node.

On the nova controller (e.g. virt1000):

   source /root/novaenv.sh
   nova live-migration --block-migrate <instanceid> <targethost>

You can check the status of a migrating instance with 'nova show <instanceid>'. Its status will show as 'migrating' until the migration is complete.

NOTE: There is one serious bug in the block-migrate feature in Havana. The migrate process attempts to check quotas on the target node, but ignores overprovision ratios. That means that the nova scheduler will frequently fill a host to the point where it can no longer accept live migrations. Because of this bug it will probably be necessary to keep two empty compute nodes in order to support complete evacuation of any one node.

Recompress a live-migrated instance

In Nova icehouse (and possibly later versions) a block migrate removes the copy-on-write elements of the instance, causing it to take up vastly more space on the new host. The instance can be recompressed if you stop it first (at which point you might as well have used cold-migrate in the first place.) Here's an example of recompressing:

 andrew@labvirt1002:~$ sudo su -
 root@labvirt1002:~# cd /var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Make sure that the instance is STOPPED with 'nova stop'
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# mv disk disk.bak
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# qemu-img convert -f qcow2 -O qcow2 disk.bak disk
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# # Restart instance, make sure it is working.
 root@labvirt1002:/var/lib/nova/instances/c9030a35-4475-4581-a84c-1728d27bcf9b# rm disk.bak

Fixing an instance that won't reboot

Occasionally an instance my fail to reboot. You can usually solve this by using reboot via nova, but occasionally that fails as well. You can force a reboot by "destroying" the instance then telling nova to reboot the instance. This causes nova to "create" the instance. Of course, "destroy" and "create" really just kill the kvm process and start the process. You should not "delete" or "terminate" the instance.

To force reboot the instance, do the following:

  1. Figure out which host the instance is running on
  2. Destroy the instance (<instance-id> can be found via virsh list):
    virsh destroy <instance-id>
  3. If you see an error like below, then you'll need to restart the libvirt-bin process, then try the destroy
    Timed out during operation: cannot acquire state change lock
  4. Tell nova to reboot the instance via "reboot"

Mounting an instance disk

This uses nbd and qemu-nbd which is part of the qemu-utils package.

Make sure the nbd kernel module is loaded:

/sbin/lsmod | grep nbd

   nbd                    20480  0

/dev/nbd* should now be present

Mounting a flat file disk with qemu-nbd and accessing a relevant partition

1. Ensure the instance is not running. Otherwise, you may corrupt the disk

   nova stop <instance-id>

2. Change to the instance directory.

   Usually this is /var/lib/nova/instances/<instance-id> or /srv/<instance-id>

3. Connect the disk to the nbd device. Consider using the --read-only flag for read-only.

   qemu-nbd [ --read-only] -c /dev/nbd0 disk

This will create an nbd process for accessing the disk:

   root     29725     1  0 14:00 ?        00:00:00 qemu-nbd -c /dev/nbd0 /srv/b784faf3-9de2-4c4e-9df8-c8e2925bfab9/disk

4. Inspect the disk for partitions. partx -l /dev/nbd0

    1:        34-     2048 (     2015 sectors,      1 MB)
    2:      2049-  1048576 (  1046528 sectors,    535 MB)
    3:   1048577- 40894430 ( 39845854 sectors,  20401 MB)
    4:  40894464- 41940991 (  1046528 sectors,    535 MB)

In this case, partition 3 is the root device.

5. Mount the required partition. mount /dev/nbd0p3 /mnt/instances/

   mount -l | grep nbd
   /dev/nbd0p3 on /mnt/instances type ext4 (ro
   /dev/nbd<device>p<partition number> will directly mount a partition within a disk.


6. Inspect or modify contents of mount

7. Unmount the device. umount /mnt/instances

8. Detach the nbd device (this will terminate the nbd process). qemu-nbd -d /dev/nbd0

Other file mounting scenarios

  1. If a disk an an ext3/4 file:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9] <mountpoint>
  2. To attach only a certain partition to the nbd device:
    • qemu-nbd --partition=<partition-number> -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9] <mountpoint>
  3. If the disk is an LVM volume:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • vgscan
    • vgchange -ay
    • mount /dev/<volume-group>/<logical-volume> <mountpoint>
  4. If the disk is a new bootstrap_vz build:
    • qemu-nbd -c /dev/nbd[0-9] <disk>
    • mount /dev/nbd[0-9]p3 /tmp/mnt

When finished, you should unmount the disk, then disconnect the volume:

  1. If the disk is not an LVM volume:
    • umount <mountpoint>
    • qemu-nbd -d /dev/nbd[0-9]
  2. If the disk is an LVM volume:
    • umount <mountpoint>
    • vgchange -an <volume-group>
    • qemu-nbd -d /dev/nbd[0-9]

(Re)Setting root password on a mounted disk

   passwd --root </mnt/path> root

Running fsck on an instance's disk

First, you'll need the mount the instance's disk. After doing so, you can simply run an fsck against it.

Trace a vnet device to an instance

VNET=<vnet-device>
for vm in $(virsh list | grep running | awk '{print $2}')
  do virsh dumpxml $vm|grep -q "$VNET" && echo $vm
done

Get the live virsh config for an instance

virsh dumpxml <instance id>

Get a screenshot of the instance's "screen"

virsh screenshot <instance id>

Send a keypress to the instance's "keyboard"

virsh send-key <instance id> <keycode>

Where keycode is the linux keycode. Most useful is "28" which is an ENTER.

A list of keycodes can be fetched from http://libvirt.org/git/?p=libvirt.git;a=blob_plain;f=src/util/keymaps.csv

Get a web-based console and root password

Nova can provide web-based console access to instances using spice-html5. These consoles use a short-lived token (10 minutes, generally) and are somewhat clumsy, but it is possible to log in a look around. Getting a console url and a password looks like this:

andrew@labcontrol1001:~$ sudo su -
root@labcontrol1001:~# source ~/novaenv.sh 
root@labcontrol1001:~# OS_TENANT_NAME=testlabs openstack server list
+--------------------------------------+------------------+--------+---------------------+
| ID                                   | Name             | Status | Networks            |
+--------------------------------------+------------------+--------+---------------------+
| f509c582-da1c-42c2-abfa-7d484d6ba552 | puppet-testing   | ACTIVE | public=10.68.21.124 |
| f1925627-7df2-49c8-98bd-1d9f7631eba3 | create-test-101  | ACTIVE | public=10.68.21.77  |
| c4bc63f8-cbd7-4384-b349-54b115e91a5c | util-abogott     | ACTIVE | public=10.68.21.108 |
| 482282c1-2c1d-4063-bd10-3d1babf9585d | relic-relics     | ACTIVE | public=10.68.17.109 |
+--------------------------------------+------------------+--------+---------------------+
root@labcontrol1001:~# openstack console url show --spice-html5 c4bc63f8-cbd7-4384-b349-54b115e91a5c
+-------+-----------------------------------------------------------------------------------------------+
| Field | Value                                                                                         |
+-------+-----------------------------------------------------------------------------------------------+
| type  | spice-html5                                                                                   |
| url   | https://labspice.wikimedia.org/spice_sec_auto.html?token=<token> |
+-------+-----------------------------------------------------------------------------------------------+
root@labcontrol1001:~# cat /var/local/labs-root-passwords/testlabs
<password>