Portal:Cloud VPS/Admin/Trove

Trove is OpenStack's database-as-a-service project. It creates and manages database instances, each of which implements one database install (e.g mysql v5.7.29) hosting one or more databases. Each instance consists of a docker container stored on a cinder volume and running on a VM. The instance implementation details are obscured from the tenant which owns the database instance, but the volume and VM can be accessed from the 'trove' service project.

setup

Puppet will do most of the installation and setup, but a few manual steps are needed. This is not (currently) meant to be a complete script, but a list of pieces not managed by puppet:

Database creation and syncing
Add keystone endpoints (example endpoint: http://openstack.codfw1dev.wikimediacloud.org:8779/v1.0/%(tenant_id)s )
Add 'trove' project and 'trove' service user. The trove user's password is stored in puppet. The user needs the 'projectadmin' and 'member' roles in the 'trove' project but NOT the 'admin' role.
Guest images need to be installed (details below)
Database docker images may need to be uploaded to our quay.io repo (details below)

guest images

The guest image contains a build of the Trove guest image. In theory old guest images are forward-compatible with new Trove installs but it's probably safest to build a new guest image for each upgrade.

A new base image can be built on a local Debian laptop or VM using a build script included in the Trove source.

git clone https://opendev.org/openstack/trove cd trove/integration/scripts

The trove guest image could be created by running the following command:

$ git clone https://opendev.org/openstack/trove
$ git checkout -b buildbranch {upstream_branch_for_current_trove_version}
$ cd trove/integration/scripts
$ ./trovestack build-image \
{guest_os} \
{guest_os_release} \
{dev_mode} \
{guest_username} \
{output_image_path}

guest_os should always be 'ubuntu'
guest_os_release should be a recen, stable Ubuntu code name (e.g. 'focal')
dev_mode should always be 'false'
guest_username should always be 'ubuntu'
Be careful about output_image_path! As of the Antelope release (and possibly in future releases) the trovestack script starts by doing a recursive chmod in output_image_path; if you point it to your home directory you will be very sad with the results. Use a new, empty directory.

Once the image is built, upload it to glance in the 'trove' project and then upgrade the trove datastore versions to use the new image.

So, for the 2023.1/antelope release:

$ git clone https://opendev.org/openstack/trove
$ git checkout -b antelope remotes/origin/stable/2023.1
$ cd trove/integration/scripts
$ ./trovestack build-image ubuntu focal false ubuntu /tmp/troveguestbuild/trove-guest-antelope-ubuntu-jammy.qcow2
$ qemu-img convert -f qcow2 -O raw /tmp/troveguestbuild/trove-guest-antelope-ubuntu-jammy.qcow2 /tmp/troveguestbuild/trove-guest-antelope-ubuntu-jammy.raw
$ openstack image create --private --project trove --tag trove --disk-format "raw" --property hw_scsi_model=virtio-scsi --property hw_disk_bus=scsi --container-format "ovf" --file /tmp/troveguestbuild/trove-guest-antelope-ubuntu-jammy.raw trove-master-guest-ubuntu-bionic
$ trove-manage datastore_version_update mysql 5.7.29 mysql <new_image_id> "" 1

Detailed (but possibly incorrect?) instructions can be found at openstack's site.

guest config

The guest central guest config is stored on the cloudcontrols at /etc/trove/trove-guestagent.conf. That file is injected into guests at boot time.

guest containers

When Trove requests a database of <type>:<version>, the guest image will attempt to fetch a Docker image with a corresponding <image:tag>, for example mysql:5.7.29. Images will only be fetched from the WMCS quay.io repo which can be found here: https://quay.io/organization/wikimedia-cloud-services

To provide a new container to Trove you will first need write access to that repository which can be granted by another existing member.

To copy a docker image from Docker Hub and upload it to the quay.io repo, the steps look like this:

# docker pull mysql:5.7.29
# docker login quay.io
   <provide username/password>
# docker tag mysql:5.7.29 quay.io/wikimedia-cloud-services/mysql:5.7.29
# docker push  quay.io/wikimedia-cloud-services/mysql:5.7.29

When adding a new image you also need to add read permissions for the 'wikimedia-cloud-services+troveguest' so that trove can access the images.

backup containers

For each datastore version, we need to provide a backup container to manage backups of that version. These should not need rebuilds for Trove upgrades, but DO need rebuilds if we support new versions of our databases, e.g. mysql 8.1.

That build process looks like this:

$ git clone https://opendev.org/openstack/trove
$ cd trove/
$ git checkout -b buildbranch {upstream_branch_for_current_trove_version}
$ cd trove/backup
$ sudo docker build --build-arg="DATASTORE=mysql8.1"  .
$ # -- vast amount of scroll omitted --
$ sudo docker image list
REPOSITORY                                            TAG       IMAGE ID       CREATED          SIZE
<none>                                                <none>    73e15cbf25e9   23 seconds ago   620MB
$ sudo docker tag 73e15cbf25e9 quay.io/wikimedia-cloud-services/db-backup-mysql:8.1
$ sudo docker push quay.io/wikimedia-cloud-services/db-backup-mysql:8.1

If supporting a new datastore type entirely, some work may be needed in the quay.io web interface to adjust permissions. Everything in our repo is meant to be fully public.

install issues

A bug in the Victoria debian packages causes trove-common config to fail and complain about statting /usr/share/trove-common/api-paste.ini. Touching that file gets things configuring properly.

Accessing Trove guest VMs

Each Trove database instance is backed by a VM of the same name in the 'trove' project. These VMs are not puppetized like normal cloud-vps VMs; they are built from the Trove guest image, described above. Consequently, they do not have standard ssh access enabled.

These instances have a public key installed for emergency troubleshooting. The key is stored by nova, owned by the 'trove' service user, and named trove-debug. The private keys can be found in the private puppet repo: modules/secret/secrets/ssh/wmcs/trove/openstack-trove-debug-key-codfw1dev and modules/secret/secrets/ssh/wmcs/trove/openstack-trove-debug-key-eqiad1.

To access a VM:

First get a copy of that private key on your laptop
Then add a security group rule permitting access on port 22 to the VM you want to investigate. Each VM has its own, private security group; VMs do NOT apply the default security group.
After all that, you should be able to ssh as the 'ubuntu' user:

ssh -i ~/.ssh/trove_codfw1dev ubuntu@consoletest10.trove.codfw1dev.wikimedia.cloud
  Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-141-generic x86_64)

Cloud-wide Trove quotas

Because the resources created by Trove are stored in the 'trove' project, the 'trove' project must have quotas big enough to support every database instance and volume cloud-wide. If instance creation is failing for no reason, it's possible that the trove project has hit its quota limits.

The quotas in effect within the trove project are standard nova quotas: instances, cores, ram, storage, and ceph storage.

Adjusting per-project Trove quotas

Each Cloud VPS project has quotas limiting trove usage: backups, instances, and volumes. These quotas aren't available via standard quota-management commands; rather, they're hidden under the 'openstack database' commandline.

user@cloudcontrol1006:~$ sudo wmcs-openstack database quota show deployment-prep
+-----------+--------+----------+-------+
| Resource  | In Use | Reserved | Limit |
+-----------+--------+----------+-------+
| backups   |      0 |        0 |     2 |
| instances |      1 |        0 |     2 |
| ram       |   1024 |        0 |  4096 |
| volumes   |      1 |        0 |     2 |
+-----------+--------+----------+-------+
user@cloudcontrol1005:~$ sudo openstack database quota update deployment-prep ram 8192
+-------+-------+
| Field | Value |
+-------+-------+
| ram   | 8192  |
+-------+-------+

The instances quota caps the total number of database instances.

The volumes quota caps the total number of GB permitted for database storage.

The ram quota caps the total number of MB permitted for database memory.

The backups quota limits the number GB permitted for database database volume backups. We currently do not use this feature so the quota is always set to 0.

Troubleshooting

Instance is down

Out of disk space

SSH to the instance, and check if the data volume has filled up.
If it has, try re-sizing it via Horizon.
If you can't, one option is to manually re-size the volume and partition and update the Trove metadata database (both instance details and quotas) to match. In phab:T355138 this was done like so:

mysql:root@localhost [trove_eqiad1]> update instances set volume_size =260 where id = 'ee0c90b0-5d21-4d41-9abf-cdabca2787c3';
Query OK, 1 row affected (0.003 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql:root@localhost [trove_eqiad1]> update quota_usages set in_use = 260 where tenant_id = 'glamwikidashboard' and resource = 'volumes';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Note that with PostgreSQL you might see an error like WARNING: archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later in the database container logs. If you see this, check that the referenced file exists in /var/lib/postgresql/data/pgdata/pg_wal/ and then delete it from /var/lib/postgresql/data/wal_archive, or otherwise the PostgreSQL WAL files will start eating all of the available disk space very quickly.

Quotas

Reserved quota does not go down

Sometimes the reservation system for Trove quotas does not update correctly the reserved and in_use counters, leaving the quotas in a state similar to:

ssh cloudcontrol1005.eqiad.wmnet
cloudcontrol1005:~$ sudo -i mariadb -u root
mysql:root@localhost [(none)]> use trove_eqiad1

## The "reserved" counter in the "quota_usages" table is not zero
mysql:root@localhost [trove_eqiad1]> select * from quota_usages where tenant_id='tofuinfratest';
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
| id                                   | created             | updated             | tenant_id     | in_use | reserved | resource  |
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
 | b3f7b371-ac14-4d02-afbd-f94f1b992c2a | 2023-06-26 19:39:38 | 2023-06-26 19:39:38 | tofuinfratest |      0 |        0 | backups   |
 | ad1bcb7e-8b6b-4d36-8382-b24812d1a0f5 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tofuinfratest |      1 |       -4 | instances |
 | cc8e7f0b-b102-4c64-aee5-c19924742e11 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tofuinfratest |   2048 |    -4096 | ram       |
 | 22714fc9-75a5-4c11-82bf-a3a0bc900424 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tofuinfratest |      2 |        8 | volumes   |
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
4 rows in set (0.001 sec)

# But there are no uncommitted rows in the reservations table:
mysql:root@localhost [trove_eqiad1]> select * from reservations where status not in ('Committed', 'Rolled Back');
Empty set (0.005 sec)

When this happens, you can manually update the reserved counters to 0:

mysql:root@localhost [trove_eqiad1]> update quota_usages set reserved=0 where tenant_id='tofuinfratest';

If you are sure there are no active database instances in the project, you can reset both reserved and in_use counters to zero:

mysql:root@localhost [trove_eqiad1]> update quota_usages set reserved=0, in_use=0 where tenant_id='tofuinfratest';

Resize failed

Resizing can fail for a couple of reasons. To diagnose any of these you will need to first SSH to the instance.

If resize fails immediately in the UI, the instance or db container is probably in a state that Trove doesn't like. Stop the db container (with 'docker ps' and 'docker stop') and try again.

If the volume resize succeeds but the total operation fails, it's likely that there was a problem with remounting the volume and resizing the filesystem. This failure will appear regularly on instances built against an older pre-focal Ubuntu version (or the upstream trove-master-guest-ubuntu-focal). This is because Trove expects the database volume to be mounted as /dev/vdb but these older distros mount as /dev/sdb. To fix you will need to complete the resize manually.

$ # make sure that the volume is the size you expect
$ lsblk
$ # make sure the filesystem is /not/ the size you expect
$ df -h
$ # probably df didn't show the db volume under /var/lib. Mount it with
$ mount /var/lib/postgresql # or similar, depending on database
$ # then resize the filesystem
$ resize2fs /dev/sdb

After any of the above operations you will need to restart the database container. That's easily done with the 'Restart Instance' option Horizon on Database->Instances.