Portal:Cloud VPS/Admin/VM images

From Wikitech

Cloud VPS uses special VM images that contain all the customizations required for our environment.

Builders

Images for Debian Buster (and subsequent release) are based off of official debian upstream builds. These are built on a cloudcontrol node.

Due to an old version of cloud-init that is bundled with upstream Stretch, we build custom base images (with modern cloud-init) for Stretch. Those are built on with bootstrap-vz on cloud-bootstrapvz-stretch.openstack.eqiad.wmflabs.

Building Pre-puppetized Images with wmcs-image-create

Our cloud should work with any upstream Debian base image that includes cloud-init version later than 18.4. The Debian project provides ready-made images prepared for use with OpenStack.

For faster startup and better storage performance, we typically use a pre-puppetized snapshot based on an upstream image. The wmcs-image-create script will automatically generate and install a pre-puppetized image. This script is invoked with a url pointing to the desired upstream image to use as a starting point. It boots a temporary VM using that image, puppetized that VM, and creates a new image based on that puppetized VM.

root@cloudcontrol1003:~# wmcs-image-create --image-url https://cloud.debian.org/images/cloud/bullseye/20221219-1234/debian-11-genericcloud-amd64-20221219-1234.tar.xz --new-image-name "debian-11 example build" --project-owner "testlabs"
INFO:wmcs-image-create:Downloading upstream image...
INFO:wmcs-image-create:Running command:
    ('wget', 'https://cloud.debian.org/images/cloud/bullseye/20221219-1234/debian-11-genericcloud-amd64-20221219-1234.tar.xz', '-O', PosixPath('/usr/local/sbin/wmcs-image-createvlj09kf2/upstreamimage'))
    options: {}
--2023-12-11 21:27:44--  https://cloud.debian.org/images/cloud/bullseye/20221219-1234/debian-11-genericcloud-amd64-20221219-1234.tar.xz
Resolving webproxy (webproxy)... 2620:0:861:3:208:80:154:74, 208.80.154.74
Connecting to webproxy (webproxy)|2620:0:861:3:208:80:154:74|:8080... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://gemmei.ftp.acc.umu.se/images/cloud/bullseye/20221219-1234/debian-11-genericcloud-amd64-20221219-1234.tar.xz [following]
--2023-12-11 21:27:45--  https://gemmei.ftp.acc.umu.se/images/cloud/bullseye/20221219-1234/debian-11-genericcloud-amd64-20221219-1234.tar.xz
Connecting to webproxy (webproxy)|2620:0:861:3:208:80:154:74|:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 159730492 (152M) [application/x-xz]
Saving to: ‘/usr/local/sbin/wmcs-image-createvlj09kf2/upstreamimage’

/usr/local/sbin/wmcs-image-createikq36 100%[===========================================================================>]   2.00G  64.8MB/s    in 33s     

2021-03-25 19:55:16 (61.5 MB/s) - ‘/usr/local/sbin/wmcs-image-createikq36pde/upstreamimage’ saved [2147483648/2147483648]
Loading the upstream image into glance
Launching a vm with the new image
Created temporary VM dc7ad6ed-b7d0-4631-b034-630b66ea0450
Sleeping 5 minutes while we wait for the VM to start up...
Waiting one minutes for VM to puppetize
Stopping the VM
Taking a snapshot of the stopped instance
Creating snapshot ec347068-949f-42c9-9ded-000c165b3b50
Waiting for snapshot to finish saving...
Grabbing handle to snapshot data
Downloading snapshot to /usr/local/sbin/wmcs-image-createikq36pde/snapshot.img
Making snapshot file sparse
Creating final image 15cd26bc-15a5-4b8e-b0bb-cc1bd6c5ca24
Setting image ownership and visibility
Cleaning up intermediate VM
Cleaning up VM snapshot
Cleaning up upstream image
Finished creating new image: 15cd26bc-15a5-4b8e-b0bb-cc1bd6c5ca24

The total run should take 10-15 minutes. Once the new image has been tested it needs to be renamed and marked public so that it's available to all projects:

# note IDs of outgoing and incoming image
root@cloudcontrol1003:~#  openstack image list
# Rename and deactivate the former image
root@cloudcontrol1003:~#  openstack image set --name "debian-12.0-bookworm (deprecated yyyy-mm-dd)" --deactivate <former active image id>
# Publish the new image
root@cloudcontrol1003:~#  openstack image set --name "debian-12.0-bookworm" --public <new image build id>

Legacy Build Process for Debian Stretch

Building with bootstrap-vz

  • Login to cloud-bootstrapvz-stretch.openstack.eqiad.wmflabs
  • Build and convert image
sudo su -
cd /target
bootstrap-vz --pause-on-error  /etc/bootstrap-vz/manifests/labs-stretch.manifest.yaml
qemu-img convert -f raw -O qcow2 debian-stretch-amd64-$(date "+%Y%m%d").raw debian-stretch-amd64-$(date "+%Y%m%d").qcow2

Local bootstrap-vz for stretch

The version of bootstrap-vz built for Stretch doesn't support setting mount options in the manifest. Since we want 'discard' set, the local install of bootstrapvz on cloud-bootstrapvz-stretch.openstack.eqiad1.wikimedia.cloud has a hack in place to make this setting, in bootstrapvz/common/tasks/filesystem.py:

190c190
<             mount_opts = ['defaults']
---
>             mount_opts = ['discard', 'defaults']

How To Test

You can boot an image locally for testing, like this:

sudo qemu-system-x86_64 -nographic -serial mon:stdio -enable-kvm image_name.raw

If the command above does not work, you try can try the following command (beware boot logs will be supressed):

qemu-system-x86_64 image_name.raw --curses

Having a working login account for test purposes is left as an exercise to the reader. bootstrap-vz creates one by default (login:root / passwd:test) but our config wisely disables it.

How To Deploy

Images are deployed to the OpenStack Glance service.

  • Copy the .qcow2 file to /tmp on the cloudcontrol1003.wikimedia.org server

Since the file has to cross the Cloud VPS / Production boundary, you can copy it from the builder server to your laptop (using your Cloud Services root key) and then from your laptop to cloudcontrol1003 (using your production key):

rsync --progress -v -e ssh root@cloud-bootstrapvz-stretch.openstack.eqiad.wmflabs:/target/debian-stretch-amd64-$(date "+%Y%m%d").qcow2 .
rsync --progress -v -e ssh debian-stretch-amd64-$(date "+%Y%m%d").qcow2 cloudcontrol1003.wikimedia.org:/tmp/

Alternatively, you can open a temporary HTTP server to make this transfer:

cloud-bootstrapvz-stretch:~ $ cd /target
cloud-bootstrapvz-stretch:~ $ python3 -m http.server 80

cloudcontrol1003:/tmp$ wget http://185.15.56.45/debian-stretch-amd64-$(date "+%Y%m%d").qcow2
  • Login to cloudcontrol1003.wikimedia.org
  • Convert the image from qcow2 to raw/sparse. This is a new step necessary now that images are stored using ceph/rbd:
qemu-img convert -f qcow2 -O raw  ./<filename>.qcow2 ./<filename>.raw.notsparse
cp --sparse=always ./<filename>.raw.notsparse  ./<filename>.raw
  • Create new image in Glance:
Only rename the existing images AFTER you upload the new images. We have monitoring in place that depends on the exact image names. Remember to update it as well (and wait for Puppet to run and update fullstackd).
sudo su -
source ~/novaenv.sh 
cd /tmp
openstack image create --file debian-stretch-amd64-$(date "+%Y%m%d").raw --disk-format "raw" --property hw_scsi_model=virtio-scsi --property hw_disk_bus=scsi --container-format "ovf" --public "debian-9.6-stretch"
  • Test new image by booting a new VM with it (if the image is faulty, remember to delete the test VM and the faulty image)
  • Update fullstackd to use this new image (see T218314 for an example).
  • Get a list of existing images
openstack image list
  • Append "deprecated" to old images and remove properties (only if new image is working as expected)
openstack image set --name "debian-9.5-stretch (deprecated <date>)" <old-image-id>
nova image-meta <old-image-id> delete default
nova image-meta <old-image-id> delete show

Passing --purge-props to openstack image set should be enough to clear all properties but it's currently not available in our OpenStack version. The nova image-meta commands serve the same purpose but you have to delete each property individually. This should be reviewed when OpenStack is upgraded.

  • Make the new image the default for new instances
openstack image set --name "debian-9.6-stretch" <new-image-id> --property show=true --property default=true

Notice in the above glance image-update commands the use of properties. If default=true the image will be the default image selected in the instance creation interface. Purging properties removes the 'default' state.

Deleting Unused Images

WARNING! You should never delete old images from Glance as long as there are VMs based on those base images.

Don't delete glance images as long as there are VMs runnning them:

  • Resizing a VM requires access to the base image it was launched with. See how resizing instances works to understand it.
  • We should always be able to identify the OS an instance is running based on the image name.

You can locate unused images with the wmcs-imageusage script:

root@cloudcontrol1003:~# wmcs-imageusage 
 -- unknown image 249c601a-42c8-4ba7-ba7a-878fba4ef799
 -- unknown image 68e61634-d599-4c6b-94da-10e6c7d36573
ea564b1c-9e5d-45a9-9d53-2fe023344e3f: debian-9.13-stretch (deprecated 2021-01-22), 0
a64f590c-3ed7-43e1-a592-b44d86f10641: debian-10.0-buster (deprecated 2019-07-30), 0
86105151-0d2e-498d-95b3-da712f19c7e2: debian-9.0-stretch (deprecated 2017-08-03), 0
559facb6-532f-47b0-9fa2-f0c0207c00c9: debian-9.13-stretch, 1
baa92f56-8aca-4855-a0f6-47d94e8a2167: debian-9.11-stretch (deprecated 2019-12-18), 1
6255ff81-db78-4521-b770-076fe16d365f: debian-9.11-stretch (deprecated 2019-12-15), 1
d5d88ba0-0def-4cfb-8c7c-cae8946722d8: debian-8.0-jessie (deprecated 2015-06-13), 1
160313be-b1b1-4b62-8906-bee26f25f6dc: debian-9.13-stretch (deprecated 2021-03-11), 2
f345f03d-a601-48cf-951b-ad1b4ad8f551: debian-9.13-stretch (deprecated 2021-01-20), 2
5cd88504-5e2b-4f63-b18e-2ec935c914dd: debian-10.0-buster-prerelease, 2
b7c4fc02-433f-4b49-97c2-b2492012b742: debian-8.2-jessie (deprecated 2016-02-16), 2
d633c111-00d1-4e72-bb3b-c2bd60518f4a: debian-10.0-buster, 3
fc6fb78b-4515-4dcc-8254-591b9fe01762: debian-10.0-buster (deprecated 2019-12-18), 3
e27bab24-4a17-47e5-bb89-b0c7526dea20: debian-9.2-stretch (deprecated 2017-12-13), 3
b7616101-3b9e-4948-98c7-25582234e788: debian-9.0-stretch (deprecated 2017-09-27), 3
0e33dcf3-37fe-416c-bc00-3abd84dda054: debian-9.0-stretch (deprecated 2017-07-19), 3
ad7bee1a-a890-4fed-b851-6c02138683b0: debian-9.13-stretch (deprecated 2021-03-01), 4
374ca3c6-6c4b-4e03-9f31-52e0d44aae0c: debian-10.0-buster (deprecated 2019-07-29), 4
d620d77c-c023-41ae-944c-2f10063bfc77: debian-9.6-stretch (deprecated 2019-02-14), 4
bb37bd5c-3cc5-4ee6-82f8-473e9568ff44: debian-9.3-stretch (deprecated 2018-01-10), 4
13b0883f-2ce6-467a-ba18-17484413faaa: debian-9.8-stretch (deprecated 2019-04-02), 7
e02770ae-b45f-4776-a852-d9a13217611e: debian-9.1-stretch (deprecated  2017-11-16), 7
325ce1c8-2f95-4e98-8088-f7b46e7a6bb5: debian-9.9-stretch (deprecated 2019-09-21), 8
e3716d55-5278-4f9b-a30f-41db9cb23ef8: debian-9.7-stretch (deprecated 2019-03-14), 10
b7274e93-30f4-4567-88aa-46223c59107e: debian-9.4-stretch (deprecated 2018-08-01), 12
73ccc348-f69d-4cd5-9f09-d83ea222fa02: debian-9.11-stretch (deprecated 2019-11-05), 13
64351116-a53e-4a62-8866-5f0058d89c2b: debian-10.0-buster (deprecated 2021-03-01), 18
6e6d743a-64f0-4cd3-b1b9-327bbf57e03b: debian-9.3-stretch (deprecated 2018-06-05), 19
25eeb420-1977-45f2-bbd9-fb65f48c0947: debian-9.11-stretch (deprecated 2020-10-17), 21
10783e59-b30e-4426-b509-2fbef7d3103c: debian-9.8-stretch (deprecated 2019-07-25), 30
e971dc0f-3b5c-4cd2-ab8b-02faf403c136: debian-10.0-buster (deprecated 2021-03-24), 46
c6273cce-9b8b-4364-9f1f-7bf58436994f: debian-9.5-stretch (deprecated 2018-11-22), 49
b6b58ba2-8656-49b4-af13-d0530ac05365: debian-10.0-buster (deprecated 2019-12-15), 71
7c6371d1-8411-48c7-bf73-2ef6d6ff2a15: debian-9.6-stretch (deprecated 2019-01-22), 83
6b67c8a1-6356-464d-a885-0576d7263e51: debian-10.0-buster (deprecated 2021-02-22), 92
031d2d76-8368-4066-a502-d28107d0195e: debian-10.0-buster (deprecated 2020-10-16), 227

Images are cheap, though, so generally image cleanup is unwarranted. Typically we only clean up images that were explicitly single-use (e.g. created to troubleshoot the image creation logic) or images for distributions that we no longer support and which no longer exist within our cloud (e.g. Debian Jessie.)

Deleting an image from glance is a single command:

root@cloudcontrol1003:~# openstack image delete <imageid>

If this fails with a warning that the image is used by VMs, stop now! Often, though, this command will fail due to out-of-band snapshots (e.g. those created by the image backup service.)

root@cloudcontrol1003:~# openstack image delete 11d52ebe-e4a0-45af-99ea-8f286ed55696
Failed to delete image with name or ID '11d52ebe-e4a0-45af-99ea-8f286ed55696': HTTP 409 Conflict: The image cannot be deleted because it has snapshot(s).
Failed to delete 1 of 1 images.

To delete all snapshots for a given image, use 'rbd snap purge':

root@cloudcontrol1003:~# rbd snap purge eqiad1-glance-images/11d52ebe-e4a0-45af-99ea-8f286ed55696
Removing all snapshots: 100% complete...done.

Once snapshots are removed the image can be removed with 'openstack image delete'.

Restricted images

It's possible to restrict an image to a single project or set of projects. This is useful when e.g. gradually deprecating use of an OS.

First, mark the image as "shared" and restrict it to the desired project:

$ ssh cloudcontrol1004.wikimedia.org
$ sudo wmcs-openstack image set --property visibility=shared --project $PROJECT_ID $IMAGE_UUID
$ sudo wmcs-openstack image set --activate $IMAGE_UUID

Then, allow the 'observer' user to access it so that OpenStack browser can display something more useful than "UNKNOWN" for instances created from it:

$ sudo su -
$ source ~/novaenv.sh
$ glance member-create $IMAGE_UUID observer
$ glance member-update $IMAGE_UUID observer accepted
FIXME: can this be done with openstack instead of glance?

How To Deactive Obsolete Images

We used to use the 'show=true' property to ensure that an image appeared on Wikitech. Now instead we use the image state, where only images with state=active appear in the GUI (both on wikitech and in Horizon.) To deactivate your obsolete image:

$ sudo wmcs-openstack image set --deactivate <image-id>

If you need to reactive it for some reason:

$ sudo wmcs-openstack image set --activate <image-id>

Please note that we usually just "deprecate" images by changing their names. Deactivating an image is a more extreme step to be used when you do not want any users to have access to it.

Internals

bootstrap-vz configuration files

Bootstrap-vz uses source files from /etc/bootstrap-vz. These files are puppetized, so you'll want to disable Puppet if you change them.

OS Configuration File
Debian Stretch /etc/bootstrap-vz/manifests/labs-stretch.manifest.yaml
Debian Jessie /etc/bootstrap-vz/manifests/labs-jessie.manifest.yaml


Bootstrap-vz also uses several source files that are standard local config files on the build host. For a complete list of these files, look at the 'file_copy:' section in /etc/bootstrap-vz/manifests/labs-{stretch,jessie}.manifest.yaml

First Boot

The first boot of the VM image is a key moment in the setup of an instance in Cloud VPS.

This is usually done by means of the /root/firstboot.sh script which is called by means of /etc/rc.local.

The script will do:

  • some LVM configuration
  • run DHCP request for configuration
  • name/domain resolution to autoconfigure the VM
  • initial puppet autoconfiguration (cert request, etc)
  • initial configuration of nscd/nslcd
  • initial apt updates
  • NFS mounts if required
  • final puppet run to fetch all remaining configuration (ssh keys, packages, etc)

Until the last point, the instance may have limited connectivity or usability.

Troubleshooting

For general Cloud VPS troubleshooting, please read the operational troubleshooting documents.

This troubleshooting section is specific for VM images (in glance) and generally only usefull when dealing with new VM images.

Common Issues

Common issues when dealing with VM images. These problems may vary from deployment to deployment, but they could be common.

  • Image does not have the puppet master CA, so it fails to fetch catalog (see phab:T181523)
  • Image does not have the puppet master CRL, so it fails to fetch catalog (see phab:T181523)
  • Image doesn't correctly resolve the hostname/domain name (so it fails to fetch its own puppet catalog)

How To Inspect Disk Contents

If you want to explore and edit the disk image of a live instance, read the docs at Cloud VPS troubleshooting, mounting an instance disk.

How to fix VM disk corruption

Please read Cloud VPS troubleshooting, fixing VM disk corruption (fsck).

Building an image from an existing VM

Nova can create a new base image out of an existing VM. This isn't something that we've done much but it's potentially useful to e.g. rapidly create a fleet of worker nodes from an original template.


  • Stop the source VM
openstack server stop <instance-id>
  • Create the new image
openstack server image create <instance-id> --name <new image name>
  • Wait for the image state to change from 'queued' to 'saving' to 'active'
openstack image show <new image id>
  • Set project ownership
openstack image set --shared <new image id>
openstack image add project <new image id> <project>

See Also