Portal:Cloud VPS/Admin/Maintenance

From Wikitech
Jump to: navigation, search

This page is for routine maintenance tasks performed by Cloud Services operators. For troubleshooting immediate issues, see the Labs_troubleshooting page.

Building new images

Wikimedia Cloud VPS provides images for Ubuntu and Debian distributions. The images are generated using different tools and the sections below describe the step by step instructions to create new ones.

Building Ubuntu images

We use vmbuilder to build our custom Ubuntu images. The vmbuilder configuration is in puppet in the labs-vmbuilder module. It can be added to a node using role::labs::vmbuilder. Here's a set of steps to build and import the images:

On vmbuilder-trusty.openstack.eqiad.wmflabs:

puppet agent -tv
cd /srv/vmbuilder
rm -Rf ubuntu-trusty
vmbuilder kvm ubuntu -c /etc/vmbuilder.cfg -d /srv/vmbuilder/ubuntu-trusty -t /srv/vmbuilder/tmp --part=/etc/vmbuilder/files/vmbuilder.partition

Note the name of the tmp file generated; for instance: "Converting /tmp/tmpD0yIQa to qcow2, format /mnt/vmbuilder/ubuntu-trusty/tmpD0yIQa.qcow2"

Note the name of the tmp file generated; for instance: "Converting /tmp/tmpD0yIQa to qcow2, format /mnt/vmbuilder/ubuntu-trusty/tmpD0yIQa.qcow2"

Building a Debian image

We build debian images using bootstrap-vz. The bootstrap-vz config is puppetized in the class labs_bootstrapvz -- on Jessie we use a custom build of the bootstrap-vz package, documented below.

To build a Debian Jessie image, log in to labs-bootstrapvz-jessie:

sudo su -
cd /target # This is where the image will end up when we finish
rm *.raw && rm *.qcow2 # Make space for our new build
bootstrap-vz /etc/bootstrap-vz/manifests/labs-jessie.manifest.yaml
qemu-img convert -f raw -O qcow2 ./debian-jessie.raw ./debian-jessie.qcow2

You can boot the image locally for testing, like this:

qemu-system-x86_64 ./<new image name>.raw --curses

Unfortunately, qemu's default behavior is to suppress all boot logs, so you'll be looking at a mostly-blank screen for several minutes before getting a login prompt with no working password. Turning on a working login account for test purposes is left as an exercise to the reader -- bootstrap-vz creates one by default (login: root passwd:test) but our config wisely disables it.

Bootstrap-vz uses source files from /etc/bootstrap-vz. These files are puppetized, so you'll want to disable puppet if you change them.

Bootstrap-vz also uses several source files that are standard local config files on the build host. For a complete list of these files, look at the 'file_copy:' section in /etc/bootstrap-vz/manifests/labs-jessie.manifest.yaml

The build process for Stretch is similar; the build system is labs-bootstrapvz-stretch.openstack.eqiad.wmflabs and the manifest to use is named labs-stretch.manifest.yaml.

Build the bootstrap-vz package

Andrew built python-bootstrap-vz_0.9wmf-1_all.deb using stddeb. It was built from the 'development' branch on 2014-12-22 with commit 255f0624b49dbcf6cacccd3b2f1fa7c0cc2bcc8d and the patch, below. To reproduce:

diff --git a/setup.py b/setup.py
index f7b97ac..349cfdc 100644
--- a/setup.py
+++ b/setup.py
@@ -22,11 +22,8 @@ setup(name='bootstrap-vz',
       install_requires=['termcolor >= 1.1.0',
                         'fysom >= 1.0.15',
                         'jsonschema >= 2.3.0',
-                        'pyyaml >= 3.10',
                         'boto >= 2.14.0',
                         'docopt >= 0.6.1',
-                        'pyrfc3339 >= 1.0',
-                        'requests >= 2.9.1',
       license='Apache License, Version 2.0',
       description='Bootstrap Debian images for virtualized environments',

  • Alter the version tag in vi bootstrapvz/__init__.py as needed
  • Install python-stdeb
  • python setup.py --command-packages=stdeb.command bdist_deb
  • ls deb_dist/*.deb

As of 2017-05-06, the .deb provided by the upstream debian Stretch repo (bootstrap-vz 0.9.10+20170110git-1) seems to work properly on Stretch without a custom build or any additional modifications.

Installing the images

On labcontrol1001:

First, get the new .qcow2 images into /tmp on labcontrol1001.wikimedia.org. One way is to rsync it from the builder VM to your laptop using your labs/labs root key, and rsync it to labcontrol1001 using your prod key. For example:

rsync -e 'ssh -i ~/.ssh/id_rsa_labs_private' root@vmbuilder-trusty.openstack.eqiad.wmflabs:/srv/vmbuilder/ubuntu-trusty/tmpUUpMh4.qcow2 .
rsync -e 'ssh -i ~/.ssh/id_rsa_prod_new' tmpUUpMh4.qcow2 madhuvishy@labcontrol1001.wikimedia.org:/tmp/

Then, also on labcontrol1001...

sudo su -
source ~/novaenv.sh 
cd /tmp
openstack image create --file ubuntu-trusty.qcow2 --disk-format "qcow2" --container-format "ovf" --public "ubuntu-14.04-trusty (testing)"
openstack image create --file debian-jessie.qcow2 --disk-format "qcow2" --container-format "ovf" --public "debian-8.1-jessie (testing)"
# Test the images by booting instances in labs; if they don't work, delete
# the instances, then delete the images (using glance delete), then
# restart the process 
glance index
# find image ids
glance image-update --name "ubuntu-14.04-trusty (deprecated <date>)" <old-trusty-id> --purge-props
glance image-update --name "ubuntu-14.04-trusty" <new-trusty-id> --property show=true --property default=true
glance image-update --name "ubuntu-12.04-precise (deprecated <date>)" <old-precise-id> --purge-props
glance image-update --name "ubuntu-12.04-precise" <new-precise-id> --property show=true

Notice in the above glance image-update commands the use of properties. If default=true the image will be the default image selected in the instance creation interface; purging properties removes the 'default' state.

We used to use the 'show=true' property to ensure that an image appeared on wikitech. Now instead we use the image state, where only images with state=active appear in the gui (both on wikitech and in Horizon.) To deactivate your obsolete image:

source /etc/novaenv.sh 
openstack token issue
curl -X POST -H 'X-Auth-Token: <token id>' http://labcontrol1001.wikimedia.org:9292/v2/images/<image id>/actions/deactivate

To reactivate an image (because it was deactivated in error, or in order to permit a migration):

source /etc/novaenv.sh 
openstack token issue
curl -X POST -H 'X-Auth-Token: <token id>' http://labcontrol1001.wikimedia.org:9292/v2/images/<image id>/actions/reactivate

Deleting a project

Project deletion tends to leave orphaned resources lying about. Eventually this should all be handled by Designate hooks, but until then:

  1. Make sure there are no instances in the project. This can be done in Horizon or via the commandline:
    $ OS_TENANT_NAME=<project> openstack server list
    $ OS_TENANT_NAME=<project> openstack server delete <instance id>
  2. Make sure there are no dns zones allocated to the project. This can be done in Horizon or via the commandline:
    $ OS_TENANT_NAME=puppet openstack zone list
    $ OS_TENANT_NAME=puppet openstack zone delete <zone id>
  3. Delete any proxies the project may have via Horizon
  4. Finally, delete the project using Manage Projects -> Delete on Wikitech. That should clean up any related project-specific Ldap records.

Labvirt reboot checklist

  1. Notify users on cloud-announce -- one week in advance if possible
  2. 'schedule downtime for this host and all services' in icinga
  3. 'schedule downtime for this host and all services' for checker.tools.wmflabs.org in icinga
  4. If VMs will be affected:
    1. Collect a list of nodes and their current state on the labvirt in question: 'nova list --all-tenants --host <hostname>'
    2. Disable puppet and stop the 'shinken' service on shinken-01.eqiad.wmflabs -- this isn't subtle but will keep alerts to a minimum
    3. depool all affected tool exec nodes
    4. failover tools nodes as needed https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Failover
    5. failover nova-proxy as needed
    6. Kubernetes nodes should generally be fine as long as only one labvirt is rebooted at a time
  5. Reboot host
  6. Wait for host to reboot, verify ssh access still works
  7. If VMs were affected
    1. once Nova has caught up to the change, all hosted VMs should switch to an 'off' state -- wait until that happens so we know that Nova is up to speed (again, check this with 'nova list --all-tenants --host <hostname>'
    2. refer back to your list of nodes and pre-reboot state from earlier; restart all VMs that were previously running, waiting 5-10 seconds after each restart to avoid flooding the Nova control plane (probably by making a script out of the output from 'nova list')
    3. repool all affected exec nodes
    4. Re-enable puppet on shinken-01.eqiad.wmflabs (puppet will restart shinken.)

Openstack Upgrade test plan

Upgrading openstack mostly consists of updating config files, changing openstack::version in hiera and then running puppet a bunch of times. In theory each individual openstack service is compatible with the n+1 and n-1 version so that the components don't have to be upgraded in a particular order.

That said, we have a test cluster, so it's best to run a test upgrade there before rolling things out for prod. Here are things to test:

  • Keystone/Ldap
    • Openstack service list
    • Openstack endpoint list
    • Create new account via wikitech
      • Set up 2fa for new account
      • Verify new user can log in on wikitech
    • Create new project via wikitech
      • Set keystone cmdline auth to new user
      • Verify new user has no access to new project
    • Keystone commandline roles
      • Assign a role to the new account
      • Remove role from new account
    • Wikitech project management
      • Add new user to a project
      • Promote user to projectadmin
      • Verify new user can log in on Horizon
      • Verify new user can view instance page for new project
      • Demote user to normal project member
  • Nova
    • Instance creation
      • verify dns entry created
      • Verify ldap record created
      • ssh access
      • check puppet run output
    • Assignment/Removal of floating IPs
    • Security groups
      • Remove instance from ssh security group, verify ssh is blocked
      • Replace instance in ssh security group, verify ssh works again
      • Add/remove source group and verify that networking between existing and new instances in the same project changes accordingly
    • Instance deletion
      • Verify dns entry cleaned up
      • Verify ldap record removed
  • Glance
    • Openstack image list
    • Create new image
    • Test instance creation with new image
  • Designate
    • Assign/remove dns entry from Horizon
  • Dynamic Proxy
    • Create/delete proxy

Maintenance scripts


Puppet installs novaenv.sh on the openstack controller. In order to run nova and glance shell commands without having to add a thousand args to the commandline,

$ source /root/novaenv.sh


$ source <(sudo cat /root/novaenv.sh)


The cold-migrate tool will shut down an instance, copy it to the specified target host, and boot it on the new host.

$ nova list --all-tenants --host <source>
$ /root/cold-migrate <args> 7d4a9768-c301-4e95-8bb9-d5aa70e94a64 <destination>

Puppet installs cold-migrate.sh in /root on the nova controller. This can take quite a while, so run this in a 'screen' session.


The imagestats script can be run periodically to list which images are currently in use -- it can also answer the question 'what instances use image xxx'? As obsolete images are abandoned they can be deleted from glance to save disk space.

Puppet installs imagestats in /root/novastats on the nova controller.


Novastats.py is a simple python library which (among other things) creates a dictionary of instance data. It's useful for writing simple one-off scripts during cluster management.

Puppet installs novastats.py in /root/novastats.py on the nova controller. You'll need to source novaenv.sh before using any of its functions.

Novastats /should/ use python openstack libraries to talk to nova, but it doesn't. Rather, it uses the openstack commandline tools.