Portal:Cloud VPS/Admin/Ceph

From Wikitech
Jump to navigation Jump to search

Ceph is a distributed storage platform which is used to provide shared block storage services for Cloud VPS instance disk and operating system images. This is a low level support system which is interesting to the Wikimedia Cloud Services team and others who work on the OpenStack hosting platform for the Cloud VPS product. It is not currently a Cloud VPS project admin or user facing feature.

Architecture

Hardware sizing

Hardware sizing recommendations using the results from our internal testing and community guidelines:

Type Bus Bandwidth (AVG) CPU Recommendations RAM Recommendations
HDD SATA 6Gb/s 150MB/s Xeon Silver 2GHz + 1 cpu core = 2 core-GHz per HDD 1GB for 1TB of storage
SSD* SATA 6Gb/s 450MB/s Xeon Silver 2GHz + 2 cpu cores = 4 core-GHz per SSD 1GB for 1TB of storage
NVME M.2. PCIe 32 Gb/s 3000MB/s Xeon Gold 2GHz * 5 cpu cores * 2 sockets = 20 core-GHz per NVME 2GB for each OSD

* POC cluster is equipped with SATA SSD drives

Note: When reviewing the sizing guides in the community keep in mind the types of drives and their capabilities.

In addition to the baseline CPU requirements, it's necessary to include additional CPU and RAM for the operating system and Ceph rebuilding, rebalancing and data scrubbing.

Next Phase Ceph OSD Server recommendation:

PowerEdge R440 Rack Server

  • 2 x Xeon Silver 4214 CPU 12 cores / 24 threads
  • 2 x 32GB RDIMM
  • 2 x 240GB SSD SATA (OS Drive)
  • 8 x 1.92TB SSD SATA (Data Drive)
  • 2 x 10GB NIC
  • No RAID (JBOD Only)

It's estimated that 15 OSD servers will provide enough initial storage capacity for existing virtual machine disk images and block devices.

BIOS

Base Settings

  • Ensure "Boot mode" is set to "BIOS" and not "UEFI". (This is required for the netboot process)

PXE boot settings

All Ceph hosts are equipped with a 10Gb Broadcom BCM57412 NIC and are not using the embedded onboard NIC.

  1. During the system boot, when prompted to configure the second Broadcom BCM57412 NIC device press "ctrl + s"
  2. On the main menu select "MBA Configuration" and toggle the "Boot Protocol" setting to "Preboot Execution Environment (PXE)"
  3. Press escape, then select "Exit and Save Configurations"
  4. After the system reboots, press "F2" to enter "System Setup"
  5. Navigate to "System BIOS > Boot Settings > BIOS Boot Settings"
  6. Select "Boot Sequence" and change the boot order to: "Hard dive C:", "NIC in Slot 2 Port 1..", "Embedded NIC 1 Port 1..."
  7. Exit System Setup, saving your changes and rebooting the system

Alternatively steps 4 through 7 can be replaced with racadm, but you will still need to enable the PXE boot protocol in the option ROM.

/admin1-> racadm set BIOS.BiosBootSettings.bootseq HardDisk.List.1-1,NIC.Slot.2-1-1,NIC.Embedded.1-1-1
/admin1-> racadm jobqueue create BIOS.Setup.1-1
/admin1-> racadm serveraction hardreset

Storage

Operating System Disks

The Ceph OSD and Monitor servers are both using software mdadm RAID1 for the operating system. This RAID set is built using the first 2 (sda and sdb) SSD drives with the partman/raid1-2dev.cfg profile.

When mixing RAID and NON-RAID devices with the LSI Logic / Symbios Logic MegaRAID SAS-3 3108 storage adapter the device IDs get mapped incorrectly for our environment. The NON-RAID devices show up first followed by the RAID device. Due to this we've selected to use software MDADM RAID instead of hardware RAID.

Ceph OSD Disks

Ceph manages data redundancy in the application layer, because of this each OSD drive is configured as a JBOD in NON-RAID mode.

SSD Drive Details

Server Total Vendor Capacity Model Function
cloudcephmon100[1-3].wikimedia.org 2 Dell 480GB SSDSC2KB480G8R RAID1 operating system
cloudcephosd100[1-3].wikimedia.org 2 Dell 240GB SSDSC2KG240G8R RAID1 operating system
cloudcephosd100[1-3].wikimedia.org 8 Dell 1.92TB SSDSC2KG019T8R JBOD Ceph Bluestore

CloudVPS-ceph-disk-layout-1.png

Network

Cloudvps-ceph-phase1-2.png

Operating System

All Ceph hosts will be using Debian 10 (Buster).

Ceph Packaging

Buster does not include a recent version of Ceph and Ceph does not provide packages for Buster (tracked at https://tracker.ceph.com/issues/42870).

Croit provides Debian Buster packages that are built using the Ceph build process and available to the community https://croit.io/2019/07/07/2019-07-07-debian-mirror https://github.com/croit/ceph. These packages have been mirrored to our local APT repository.

Puppet

Roles

  • wmcs::ceph::mon Deploys the Ceph monitor and manager daemon to support CloudVPS hypervisors
  • wmcs::ceph::osd Deploys the Ceph object storage daemon to support CloudVPS hypervisors
  • role::wmcs::openstack::eqiad1::virt_ceph Deploys nova-compute configured with RBD based virtual machines (Note that switching between this role and wmcs::openstack::eqiad1::virt should work gracefully without a rebuild)

Profiles

  • profile::ceph::client::rbd Install and configure a Ceph RBD client
  • profile::ceph::osd Install and configure Ceph object storage daemon
  • profile::ceph::mon Install and configure Ceph monitor and manager daemon
  • profile::ceph::alerts Configure Ceph cluster alerts in Icinga

Modules

  • ceph Install and configure the base Ceph installation used by all services and clients
  • ceph::admin Configures the Ceph administrator keyring
  • ceph::mgr Install and configure the Ceph manager daemon
  • ceph::mon Install and configure the Ceph monitor daemon
  • ceph::keyring Defined resource that manages access control and keyrings

Post Installation Procedures

Adding OSDs

Locate available disks with lsblk

 cloudcephosd1001:~# lsblk
 NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
 sda                                                                                                     8:0    0 223.6G  0 disk
 ├─sda1                                                                                                  8:1    0  46.6G  0 part
 │ └─md0                                                                                                 9:0    0  46.5G  0 raid1 /
 ├─sda2                                                                                                  8:2    0   954M  0 part
 │ └─md1                                                                                                 9:1    0   953M  0 raid1 [SWAP]
 └─sda3                                                                                                  8:3    0 176.1G  0 part
   └─md2                                                                                                 9:2    0   176G  0 raid1
     └─cloudcephosd1001--vg-data                                                                       253:2    0 140.8G  0 lvm   /srv
 sdb                                                                                                     8:16   0 223.6G  0 disk
 ├─sdb1                                                                                                  8:17   0  46.6G  0 part
 │ └─md0                                                                                                 9:0    0  46.5G  0 raid1 /
 ├─sdb2                                                                                                  8:18   0   954M  0 part
 │ └─md1                                                                                                 9:1    0   953M  0 raid1 [SWAP]
 └─sdb3                                                                                                  8:19   0 176.1G  0 part
   └─md2                                                                                                 9:2    0   176G  0 raid1
     └─cloudcephosd1001--vg-data                                                                       253:2    0 140.8G  0 lvm   /srv
 sdc                                                                                                     8:80   0   1.8T  0 disk
 sdd                                                                                                     8:96   0   1.8T  0 disk
 sde                                                                                                     8:80   0   1.8T  0 disk
 sdf                                                                                                     8:80   0   1.8T  0 disk
 sdg                                                                                                     8:96   0   1.8T  0 disk
 sdh                                                                                                     8:112  0   1.8T  0 disk
 sdi                                                                                                     8:128  0   1.8T  0 disk
 sdj                                                                                                     8:144  0   1.8T  0 disk

To prepare a disk for Ceph first zap the disk

 cloudcephosd1001:~# ceph-volume lvm zap /dev/sdc
 --> Zapping: /dev/sdc
 --> --destroy was not specified, but zapping a whole device will remove the partition table
 Running command: /bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10
  stderr: 10+0 records in
 10+0 records out
 10485760 bytes (10 MB, 10 MiB) copied, 0.00357845 s, 2.9 GB/s
 --> Zapping successful for: <Raw Device: /dev/sdc>

Then prepare, activate and start the new OSD

 cloudcephosd1001:~# ceph-volume lvm create --bluestore --data /dev/sde

Creating Pools

To create a new storage pool you will first need to determine the number of placement groups that will be assigned to the new pool. You can use the calculator at https://ceph.io/pgcalc/ to help identify the starting point (not you can easily increase, but not decrease this value):

sudo ceph osd pool create eqiad1-compute 512

Enable the RBD application for the new pool

sudo ceph osd pool application enable eqiad1-compute rbd

Configuring Client Keyrings

The Cloud VPS hypervisors connect to the compute pool using RBD. After running the following command the keyring data should be stored in private hiera file: eqiad/profile/ceph/client/rbd.yaml with profile::ceph::client::rbd::keydata key. (Note all nova-compute hypervisors are configured to use the same Ceph keyring)

 cloudcephmon1001:~$ sudo ceph auth get-or-create client.eqiad1-compute mon 'profile rbd' osd 'profile rbd pool=compute'

Monitoring

The Ceph managers are configured to run a Prometheus exporter that exports global cluster metrics. This exporter is only active on one manager at a time, failing over between active managers is handled directly by Ceph. The cloudmetrics Prometheus server is configured to scrape this exporter.

Dashboards

In addition to custom Grafana dashboards, The dashboards provided by the Ceph community have also been installed and updated for our environment.

Icinga alerts

These alerts are global to the Ceph cluster and configured on the icinga1001 host to avoid getting multiple notifications for a single event.

Ceph Cluster Health

  • Description
    Ceph storage cluster health check
  • Status Codes
    • 0 - healthy, all services are healthy
    • 1 - warn, cluster is running in a degraded state, data is still accessible
    • 2 - critical, cluster is failed, some or all data is inaccessible
  • Next steps
    • On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command sudo ceph --status. Example output from a healthy cluster:
cloudcephmon1001:~$ sudo ceph --status
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK

 services:
   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 3w)
   mgr: cloudcephmon1002(active, since 10d), standbys: cloudcephmon1003, cloudcephmon1001
   osd: 24 osds: 24 up (since 3w), 24 in (since 3w)

 data:
   pools:   1 pools, 256 pgs
   objects: 3 objects, 19 B
   usage:   25 GiB used, 42 TiB / 42 TiB avail
   pgs:     256 active+clean

Ceph Monitor Quorum

  • Description
    Verify there are enough Ceph monitor daemons running for proper quorum
  • Status Codes
    • 0 - healthy, 3 or more Ceph Monitors are running
    • 2 - critical, Less than 3 Ceph monitors are running
  • Next steps
    • On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command sudo ceph mon stat. Example output from a healthy cluster:
cloudcephmon1001:~$ sudo ceph mon stat
e1: 3 mons at {cloudcephmon1001=[v2:208.80.154.148:3300/0,v1:208.80.154.148:6789/0],cloudcephmon1002=[v2:208.80.154.149:3300/0,v1:208.80.154.149:6789/0],cloudcephmon1003=[v2:208.80.154.150:3300/0,v1:208.80.154.150:6789/0]}, election epoch 24, leader 0 cloudcephmon1001, quorum 0,1,2 cloudcephmon1001,cloudcephmon1002,cloudcephmon1003

Maintenance

Restarting Ceph services

Always ensure the cluster is healthy before restarting any services

cloudcephosd1001:~$ sudo ceph -s
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK

 services:
   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 14m)
   mgr: cloudcephmon1002(active, since 17m), standbys: cloudcephmon1001, cloudcephmon1003
   osd: 24 osds: 24 up (since 10w), 24 in (since 10w)

 data:
   pools:   1 pools, 256 pgs
   objects: 41.22k objects, 160 GiB
   usage:   502 GiB used, 41 TiB / 42 TiB avail
   pgs:     256 active+clean

 io:
   client:   25 KiB/s wr, 0 op/s rd, 1 op/s wr

ceph-crash

The ceph-crash service can be restarted on all hosts without any service interruption

$ sudo cumin 'P{R:Class = role::wmcs::ceph::mon or R:Class = role::wmcs::ceph::osd}' 'systemctl restart ceph-crash'

ceph-mgr

The ceph-mgr service can be restarted one by one across the ceph-mon hosts

$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1001'
$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1002'
$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1003'

ceph-mon

IMPORTANT: This process should be ran by the WMCS team IMPORTANT: Only one Ceph-mon can be offline at anytime.

The ceph-mon service can be restarted one by one across the ceph-mon hosts, in between restarts ensure the services have rejoined the cluster

$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1001'
$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'ceph -s'
$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1002'
$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'ceph -s'
$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1003'
$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'ceph -s'

Example healthy output from ceph -s

...
  services:                                                                                                                                                                                                                                                        
   mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 7s)                                                                                                                                                                             
   mgr: cloudcephmon1002(active, since 2m), standbys: cloudcephmon1001, cloudcephmon1003                                                                                                                                                                          
   osd: 24 osds: 24 up (since 10w), 24 in (since 10w)     
...

ceph-osd

IMPORTANT: This process should be ran by the WMCS team IMPORTANT: Restarting multiple ceph-osd processes together can cause service outage or heavy network utilization during data rebalancing.

Until we have more Ceph OSD servers, we'll want to restart each OSD one at a time to ensure the cluster remains healthy.

Identify the systemctl unit files for each OSD

$ systemctl | grep ceph-osd@[0-9]
ceph-osd@0.service                                                                               loaded active running   Ceph object storage daemon osd.0                                             
ceph-osd@1.service                                                                               loaded active running   Ceph object storage daemon osd.1                                             
ceph-osd@12.service                                                                              loaded active running   Ceph object storage daemon osd.12                                            
ceph-osd@15.service                                                                              loaded active running   Ceph object storage daemon osd.15                                            
ceph-osd@18.service                                                                              loaded active running   Ceph object storage daemon osd.18                                            
ceph-osd@21.service                                                                              loaded active running   Ceph object storage daemon osd.21                                            
ceph-osd@3.service                                                                               loaded active running   Ceph object storage daemon osd.3                                             
ceph-osd@9.service                                                                               loaded active running   Ceph object storage daemon osd.9  

Restart the ceph-osd services one by one

$ sudo systemctl restart ceph-osd@0

Ceph health after restarting a service

$ sudo ceph -s
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_WARN
           1 osds down
...

Once the OSD is back online the data verification begins

$ sudo ceph -s
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_WARN
           Degraded data redundancy: 1842/123657 objects degraded (1.490%), 11 pgs degraded
...

Within a few seconds the OSD should be fully back online and healthy

$ sudo ceph -s
 cluster:
   id:     5917e6d9-06a0-4928-827a-f489384975b1
   health: HEALTH_OK
...



Removing or Replacing a failed OSD drive

The process to remove or replace a failed OSD are the same. Both the OSD configuration and service are removed from the storage cluster.

  1. Locate the failed OSD (osd.0 in this example)
    cloudcephmon1001:~$ sudo ceph osd tree
    ID CLASS WEIGHT   TYPE NAME                 STATUS REWEIGHT PRI-AFF 
    -1       41.90625 root default                                      
    -3       13.96875     host cloudcephosd1001                         
     0   ssd  1.74609         osd.0               down  1.00000 1.00000 
     1   ssd  1.74609         osd.1                 up  1.00000 1.00000 
     3   ssd  1.74609         osd.3                 up  1.00000 1.00000 
     9   ssd  1.74609         osd.9                 up  1.00000 1.00000 
    12   ssd  1.74609         osd.12                up  1.00000 1.00000 
    15   ssd  1.74609         osd.15                up  1.00000 1.00000 
    18   ssd  1.74609         osd.18                up  1.00000 1.00000 
    21   ssd  1.74609         osd.21                up  1.00000 1.00000 
    -5       13.96875     host cloudcephosd1002                         
     2   ssd  1.74609         osd.2                 up  1.00000 1.00000 
     4   ssd  1.74609         osd.4                 up  1.00000 1.00000 
     5   ssd  1.74609         osd.5                 up  1.00000 1.00000 
    10   ssd  1.74609         osd.10                up  1.00000 1.00000 
    13   ssd  1.74609         osd.13                up  1.00000 1.00000 
    16   ssd  1.74609         osd.16                up  1.00000 1.00000 
    19   ssd  1.74609         osd.19                up  1.00000 1.00000 
    22   ssd  1.74609         osd.22                up  1.00000 1.00000 
    -7       13.96875     host cloudcephosd1003                         
     6   ssd  1.74609         osd.6                 up  1.00000 1.00000 
     7   ssd  1.74609         osd.7                 up  1.00000 1.00000 
     8   ssd  1.74609         osd.8                 up  1.00000 1.00000 
    11   ssd  1.74609         osd.11                up  1.00000 1.00000 
    14   ssd  1.74609         osd.14                up  1.00000 1.00000 
    17   ssd  1.74609         osd.17                up  1.00000 1.00000 
    20   ssd  1.74609         osd.20                up  1.00000 1.00000 
    23   ssd  1.74609         osd.23                up  1.00000 1.00000
    
  2. Stop the ceph-osd service on the server with the failed OSD
    cloudcephosd1001:~$ sudo systemctl stop ceph-osd@0
    
  3. Remove the OSD from Ceph
    cloudcephosd1001:~$ sudo ceph osd out 0
    
  4. Verify that Ceph is rebuilding the placement groups
    cloudcephosd1001:~$ sudo ceph health -w
    
  5. Remove the OSD from the CRUSH map
    cloudcephosd1001:~$ sudo ceph osd crush remove osd.0
    
  6. Remove the OSD authentication keys
    cloudcephosd1001:~$ sudo ceph auth del osd.0
    
  7. Remove the OSD from Ceph
    cloudcephosd1001:~$ sudo ceph osd rm osd.0
    
  8. Replace the physical drive
  9. Follow the process to add a new OSD once the drive is replaced

Clients

Rate limiting

Native

Native RBD rate limiting is supported in the Ceph Nautilus release. Due to upstream availability and multiple Debian releases we will likely have a mixture of older Ceph client versions during phase1.

Available rate limiting options and their defaults in the Nautilus release:

$ rbd config pool ls <pool> | grep qos
rbd_qos_bps_burst                                   0         config
rbd_qos_bps_limit                                   0         config
rbd_qos_iops_burst                                  0         config
rbd_qos_iops_limit                                  0         config
rbd_qos_read_bps_burst                              0         config
rbd_qos_read_bps_limit                              0         config
rbd_qos_read_iops_burst                             0         config
rbd_qos_read_iops_limit                             0         config
rbd_qos_schedule_tick_min                           50        config
rbd_qos_write_bps_burst                             0         config
rbd_qos_write_bps_limit                             0         config
rbd_qos_write_iops_burst                            0         config
rbd_qos_write_iops_limit                            0         config

OpenStack

IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.

Available disk tuning options
  • disk_read_bytes_sec
  • disk_read_iops_sec
  • disk_write_bytes_sec
  • disk_write_iops_sec
  • disk_total_bytes_sec
  • disk_total_iops_sec

Example commands to create or modify flavors metadata with rate limiting options roughly equal to a 7200RPM SATA Disk:

openstack flavor create \
  --ram 2048 \
  --disk 20 \
  --vcpus 1 \
  --private \
  --project testlabs \
  --id 857921a5-f0af-4069-8ad1-8f5ea86c8ba2 \
  --property quota:disk_total_iops_sec=200 m1.small-ceph
openstack flavor set --property quota:disk_total_bytes_sec=250000000 857921a5-f0af-4069-8ad1-8f5ea86c8ba2

The Ceph host aggregate will pin flavors to the Hypervisors configured to use Ceph

openstack flavor set --property aggregate_instance_extra_specs:ceph=true 857921a5-f0af-4069-8ad1-8f5ea86c8ba2

Example rate limit configuration as seen by libvirt. (virsh dumpxml <instance name>)

<target dev='sda' bus='virtio'/>
 <iotune>
  <total_bytes_sec>250000000</total_bytes_sec>
  <total_iops_sec>200</total_iops_sec>
 </iotune>

NOTE: Updating a flavors metadata does not have any effect on existing virtual machines.

When a virtual machine is created the requested flavor data is copied to the instance, and any future updates to the flavor are ignored. This script will connect to the Nova database and directly modify extra specs.

$ wmcs-vm-extra-specs --help
usage: wmcs-vm-extra-specs [-h] [--nova-db-server NOVA_DB_SERVER] [--nova-db NOVA_DB] [--mysql-password MYSQL_PASSWORD]
                          uuid
                          {quota:disk_read_bytes_sec,
                           quota:disk_read_iops_sec,
                           quota:disk_total_bytes_sec,
                           quota:disk_total_iops_sec,
                           quota:disk_write_bytes_sec,
                           quota:disk_write_iops_sec}
                          spec_value

Glance

Glance images used by Ceph based virtual machines should have the hw_scsi_model=virtio-scsi hw_disk_bus=scsi properties defined. These properties will enable SCSI discard operations and instruct Ceph to remove blocks that have been deleted.

Image create example

openstack image create --file debian-buster-20191218.qcow2 \
  --disk-format "qcow2" \
  --property hw_scsi_model=virtio-scsi \
  --property hw_disk_bus=scsi \
  --public \
  debian-10.0-buster

(Once Glance is configured to store images in Ceph, the disk-format will change from qcow2 to raw)

Nova compute

Migrating local VMs to Ceph

Switch puppet roles to the Ceph enabled wmcs::openstack::eqiad1::virt_ceph role. In operations/puppet/manifest/site.pp:

node 'cloudvirt1022.eqiad.wmnet' {
   role(wmcs::openstack::eqiad1::virt_ceph)
}

Run the puppet agent on the hypervisor

hypervisor $ sudo puppet agent -tv

Shutdown the VM

cloudcontrol $ openstack server stop <UUID>

Convert the local QCOW2 image to raw and upload to Ceph

hypervisor $ qemu-img convert -f qcow2 -O raw /var/lib/nova/instances/<UUID>/disk rbd:compute/<UUID>_disk:id=eqiad1-compute

Undefine the virtual machine. This command removes the existing libvirt definition from the hypervisor, once nova attempts to start the VM it will be redefined with the RBD configuration. (This step can be ignored, but you may notice some errors in nova-compute.log until the VM has been restarted)

hypervisor $ virsh undefine <OS-EXT-SRV-ATTR:instance_name>

Cleanup local storage files

hypervisor $ rm /var/lib/nova/instances/<UUID>/disk
hypervisor $ rm /var/lib/nova/instances/<UUID>/disk.info

Power on the VM

cloudcontrol $ openstack server start <UUID>

Reverting back to local storage

Shutdown the VM

cloudcontrol $ openstack server stop <UUID>

Convert the Ceph raw image back to QCOW2 on the local hypervisor

hypervisor $ qemu-img convert -f raw -O qcow2 rbd:compute/<UUID>_disk:id=eqiad1-compute /var/lib/nova/instances/UUID/disk

Power on the VM

cloudcontrol $ openstack server start <UUID>

CPU Model Type

A virtual machine can only be live migrated to a hypervisor matching the same CPU. CloudVPS currently has multiple CPU models and is using the default "host-model" nova configuration.

To enable live migration between any production hypervisor, the cpu_mode parameter should match the lowest hypervisor CPU model.

Hypervisor range CPU model Launch date
cloudvirt[1023-1030].eqiad.wmnet Gold 6140 Skylake 2017
cloudvirt[1016-1022].eqiad.wmnet E5-2697 v4 Broadwell 2016
cloudvirt[1012-1014].eqiad.wmnet E5-2697 v3 Haswell 2014
cloudvirt[1001-1009].eqiad.wmnet E5-2697 v2 Ivy Bridge 2013

Virtual Machine Images

Important: Using QCOW2 for hosting a virtual machine disk is NOT recommended. If you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), please use the raw image format within Glance.

Once all CloudVPS virtual machines have been migrated to Ceph we can convert the existing virtual machine images in Glance from QCOW2 to raw. This will avoid having nova-compute convert the image each time a new virtual machine is created.

VirtIO SCSI devices

Currently CloudVPS virtual machines are configured with the virtio-blk driver. This driver does not support discard/trim operations to free up deleted blocks.

Discard support can be enabled by using the virtio-scsi driver, but it's important to note that the device labels will change from /dev/vda to /dev/sda.

Performance Testing

Network

Jumbo frames (9k MTU) have been configured to improve the network throughput and overall network performance, as well as reduce the CPU utilization on the Ceph OSD servers.

Baseline (default tuning options)

Iperf options used to simulate Ceph storage IO.

-N disable Nagle's Algorithm
-l 4M set read/write buffer size to 4 megabyte
-P number of parallel client threads to run (one per OSD)

Server:

iperf -s -N -l 4M

Client:

iperf -c <server> -N -l 4M -P 8
cloudcephosd <-> cloudcephosd
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.74 GBytes  2.35 Gbits/sec
[ 10]  0.0-10.0 sec  2.74 GBytes  2.35 Gbits/sec
[  9]  0.0-10.0 sec   664 MBytes   557 Mbits/sec
[  6]  0.0-10.0 sec   720 MBytes   603 Mbits/sec
[  5]  0.0-10.0 sec  1.38 GBytes  1.18 Gbits/sec
[ 13]  0.0-10.0 sec  1.38 GBytes  1.18 Gbits/sec
[  7]  0.0-10.0 sec   720 MBytes   602 Mbits/sec
[  8]  0.0-10.0 sec   720 MBytes   603 Mbits/sec
[SUM]  0.0-10.0 sec  11.0 GBytes  9.42 Gbits/sec
cloudvirt1022 -> cloudcephosd
cloudvirt1022 <-> cloudcephosd: 8.55 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  1.11 GBytes   949 Mbits/sec
[  6]  0.0-10.0 sec  1.25 GBytes  1.07 Gbits/sec
[  4]  0.0-10.0 sec  1.39 GBytes  1.19 Gbits/sec
[  9]  0.0-10.0 sec  1.24 GBytes  1.06 Gbits/sec
[ 10]  0.0-10.0 sec  1.07 GBytes   920 Mbits/sec
[  5]  0.0-10.0 sec  1.36 GBytes  1.16 Gbits/sec
[  3]  0.0-10.0 sec  1.41 GBytes  1.21 Gbits/sec
[  8]  0.0-10.0 sec  1.17 GBytes  1.00 Gbits/sec
[SUM]  0.0-10.0 sec  10.0 GBytes  8.55 Gbits/sec

Ceph RBD

Test cases

FIO random read/write

$ fio --name fio-randrw \
      --bs=64k \
      --direct=1 \
      --filename=/srv/fio.randrw \
      --fsync=256 \
      --gtod_reduce=1 \
      --iodepth=64 \
      --ioengine=libaio \
      --randrepeat=1 \
      --readwrite=randrw \
      --rwmixread=50 \
      --size=5G \
      --group_reporting

FIO sequential read/write

$ fio --name=fio-seqrw \
      --bs=64k \
      --direct=1 \
      --filename=/srv/fio.seqrw \
      --fsync=256 \
      --gtod_reduce=1 \
      --iodepth=64 \
      --ioengine=libaio \
      --rw=rw \
      --size=5G \
      --group_reporting

Baseline (default tuning options)

single virtual machine
$ dd if=/dev/zero of=/srv/test.dd bs=4k count=125000 conv=sync
512000000 bytes (512 MB, 488 MiB) copied, 0.875202 s, 585 MB/s

FIO sequential read/write

fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
fio-seqrw: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=12.1MiB/s,w=11.9MiB/s][r=3092,w=3045 IOPS][eta 00m:00s]
fio-seqrw: (groupid=0, jobs=1): err= 0: pid=31970: Fri Jan 10 15:24:29 2020
 read: IOPS=3849, BW=15.0MiB/s (15.8MB/s)(2561MiB/170310msec)
  bw (  KiB/s): min= 7048, max=41668, per=100.00%, avg=15403.11, stdev=7014.38, samples=340
  iops        : min= 1762, max=10417, avg=3850.78, stdev=1753.59, samples=340
 write: IOPS=3846, BW=15.0MiB/s (15.8MB/s)(2559MiB/170310msec); 0 zone resets
  bw (  KiB/s): min= 6464, max=41365, per=100.00%, avg=15389.03, stdev=7006.27, samples=340
  iops        : min= 1616, max=10341, avg=3847.25, stdev=1751.56, samples=340
 cpu          : usr=3.43%, sys=11.35%, ctx=623109, majf=0, minf=9
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  READ: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2561MiB (2686MB), run=170310-170310msec
 WRITE: bw=15.0MiB/s (15.8MB/s), 15.0MiB/s-15.0MiB/s (15.8MB/s-15.8MB/s), io=2559MiB (2683MB), run=170310-170310msec

Disk stats (read/write):
 vda: ios=656106/663558, merge=28/3888, ticks=3895800/3550224, in_queue=7399608, util=74.52%

Rate limiting enabled

single virtual machine
$ dd if=/dev/zero of=/srv/1test.dd bs=4k count=125000 conv=sync
512000000 bytes (512 MB, 488 MiB) copied, 4.57852 s, 112 MB/s

FIO sequential read/write

fio-seqrw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=228KiB/s,w=172KiB/s][r=57,w=43 IOPS][eta 00m:00s]
fio-seqrw: (groupid=0, jobs=1): err= 0: pid=30958: Fri Jan 10 19:10:12 2020
 read: IOPS=49, BW=198KiB/s (203kB/s)(2561MiB/13237587msec)
  bw (  KiB/s): min=    7, max=  584, per=100.00%, avg=201.54, stdev=48.09, samples=26014
  iops        : min=    1, max=  146, avg=50.33, stdev=12.02, samples=26014
 write: IOPS=49, BW=198KiB/s (203kB/s)(2559MiB/13237587msec); 0 zone resets
  bw (  KiB/s): min=    7, max=  696, per=100.00%, avg=201.30, stdev=56.83, samples=26023
  iops        : min=    1, max=  174, avg=50.27, stdev=14.21, samples=26023
 cpu          : usr=0.16%, sys=0.62%, ctx=1208453, majf=0, minf=10
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.4%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=655676,655044,0,5021 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  READ: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2561MiB (2686MB), run=13237587-13237587msec
 WRITE: bw=198KiB/s (203kB/s), 198KiB/s-198KiB/s (203kB/s-203kB/s), io=2559MiB (2683MB), run=13237587-13237587msec

Disk stats (read/write):
 vda: ios=659758/669762, merge=0/9391, ticks=517938687/296952049, in_queue=800011092, util=98.11%

CLI examples

Create, format and mount a RBD image (useful for testing / debugging)

$ rbd create datatest --size 250 --pool compute --image-feature layering
$ rbd map datatest --pool compute --name client.admin
$ mkfs.ext4 -m0 /dev/rbd0
$ mount /dev/rbd0 /mnt/
$ umount /mnt
$ rbd unmap /dev/rbd0
$ rbd rm compute/datatest

List RBD nova images

$ rbd ls -p compute
9e2522ca-fd5e-4d42-b403-57afda7584c0_disk

Show RBD image information

$ rbd info -p compute 9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk
rbd image '9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk':
       size 20 GiB in 5120 objects
       order 22 (4 MiB objects)
       snapshot_count: 0
       id: aec56b8b4567
       block_name_prefix: rbd_data.aec56b8b4567
       format: 2
       features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
       op_features:
       flags:
       create_timestamp: Mon Jan  6 21:36:11 2020
       access_timestamp: Mon Jan  6 21:36:11 2020
       modify_timestamp: Mon Jan  6 21:36:11 2020

View RBD image with qemu tools on a hypervisor

$ qemu-img info rbd:<pool>/<vm uuid>_disk:id=<ceph user>