Portal:Cloud VPS/Admin/Ceph
Ceph is a distributed storage platform which is used to provide shared block storage services for Cloud VPS instance disk and operating system images. This is a low level support system which is interesting to the Wikimedia Cloud Services team and others who work on the OpenStack hosting platform for the Cloud VPS product. It is not currently a Cloud VPS project admin or user facing feature.
Architecture
Hardware sizing
Hardware sizing recommendations using the results from our internal testing and community guidelines:
Type | Bus | Bandwidth (AVG) | CPU Recommendations | RAM Recommendations |
---|---|---|---|---|
HDD | SATA 6Gb/s | 150MB/s | Xeon Silver 2GHz + 1 cpu core = 2 core-GHz per HDD | 1GB for 1TB of storage |
SSD* | SATA 6Gb/s | 450MB/s | Xeon Silver 2GHz + 2 cpu cores = 4 core-GHz per SSD | 1GB for 1TB of storage |
NVME | M.2. PCIe 32 Gb/s | 3000MB/s | Xeon Gold 2GHz * 5 cpu cores * 2 sockets = 20 core-GHz per NVME | 2GB for each OSD |
* POC cluster is equipped with SATA SSD drives
Note: When reviewing the sizing guides in the community keep in mind the types of drives and their capabilities.
In addition to the baseline CPU requirements, it's necessary to include additional CPU and RAM for the operating system and Ceph rebuilding, rebalancing and data scrubbing.
Next Phase Ceph OSD Server recommendation:
PowerEdge R440 Rack Server
- 2 x Xeon Silver 4214 CPU 12 cores / 24 threads
- 2 x 32GB RDIMM
- 2 x 240GB SSD SATA (OS Drive)
- 8 x 1.92TB SSD SATA (Data Drive)
- 2 x 10GB NIC
- No RAID (JBOD Only)
It's estimated that 15 OSD servers will provide enough initial storage capacity for existing virtual machine disk images and block devices.
BIOS
Base Settings
- Ensure "Boot mode" is set to "BIOS" and not "UEFI". (This is required for the netboot process)
PXE boot settings
All Ceph hosts are equipped with a 10Gb Broadcom BCM57412 NIC and are not using the embedded onboard NIC.
- During the system boot, when prompted to configure the second Broadcom BCM57412 NIC device press "ctrl + s"
- On the main menu select "MBA Configuration" and toggle the "Boot Protocol" setting to "Preboot Execution Environment (PXE)"
- Press escape, then select "Exit and Save Configurations"
- After the system reboots, press "F2" to enter "System Setup"
- Navigate to "System BIOS > Boot Settings > BIOS Boot Settings"
- Select "Boot Sequence" and change the boot order to: "Hard dive C:", "NIC in Slot 2 Port 1..", "Embedded NIC 1 Port 1..."
- Exit System Setup, saving your changes and rebooting the system
Alternatively steps 4 through 7 can be replaced with racadm, but you will still need to enable the PXE boot protocol in the option ROM.
/admin1-> racadm set BIOS.BiosBootSettings.bootseq HardDisk.List.1-1,NIC.Slot.2-1-1,NIC.Embedded.1-1-1 /admin1-> racadm jobqueue create BIOS.Setup.1-1 /admin1-> racadm serveraction hardreset
Storage
Operating System Disks
The Ceph OSD and Monitor servers are both using software mdadm RAID1 for the operating system. This RAID set is built using the first 2 (sda and sdb) SSD drives with the partman/raid1-2dev.cfg profile.
When mixing RAID and NON-RAID devices with the LSI Logic / Symbios Logic MegaRAID SAS-3 3108 storage adapter the device IDs get mapped incorrectly for our environment. The NON-RAID devices show up first followed by the RAID device. Due to this we've selected to use software MDADM RAID instead of hardware RAID.
Ceph OSD Disks
Ceph manages data redundancy in the application layer, because of this each OSD drive is configured as a JBOD in NON-RAID mode.
SSD Drive Details
Server | Total | Vendor | Capacity | Model | Function |
---|---|---|---|---|---|
cloudcephmon100[1-3].wikimedia.org | 2 | Dell | 480GB | SSDSC2KB480G8R | RAID1 operating system |
cloudcephosd100[1-3].wikimedia.org | 2 | Dell | 240GB | SSDSC2KG240G8R | RAID1 operating system |
cloudcephosd100[1-3].wikimedia.org | 8 | Dell | 1.92TB | SSDSC2KG019T8R | JBOD Ceph Bluestore |
Network
Description and configuration hints about the network as related to the ceph project. Note that our ceph cluster is ipv4 only. Ceph can only do one or the other, and ipv4 is more useful for our current cloud setup.
switches
Given lack of 10G ports and physical racking space in eqiad row B, we got new dedicated 10G switches called cloudsw in different rows D and C.
There are physical cables that connect cloudsw1-d5-eqiad and cloudsw1-c8-eqiad, and then cloudsw1-c8-eqiad to asw2-b2 and asw2-b7.
Both cloudsw* and asw2-b* devices have at least the following 2 VLANs defined and trunked:
- the special storage VLAN/subnet used for inter-cluster sync cloud-storage1-eqiad - vlan 1106 with addressing 192.168.x.y/z
- the SSH management and client traffic VLAN/subnet cloud-hosts1-b-eqiad - vlan 1118 with addressing 10.64.20.0/24
The special storage VLAN/subnet doesn't have a L3 gateway configured anywhere.
This is, we are stretching VLANs across rows.
servers
Ceph storage nodes (cloudcephosdXXXX) have 2 NICs connected to 2 different networks:
- eth0: cloud-hosts1-b-eqiad - vlan 1118 with addressing 10.64.20.0/24 -- used for SSH management and client traffic.
- eth1: cloud-storage1-eqiad - vlan 1106 with addressing 192.168.x.y/z -- used for inter-cluster synchronization.
Ceph monitor nodes (cloudcephmonXXXX) have 1 NIC connected to:
- eth0: cloud-hosts1-b-eqiad - vlan 1118 with addressing 10.64.20.0/24 -- used for SSH management and client traffic.
Hypervisor servers (cloudvirtXXXX) are expected to connect to cloudcephmonXXXX using the cloud-hosts1-b-eqiad subnet, thus a direct connection (no L3 gateway involved).
Operating System
All Ceph hosts will be using Debian 10 (Buster).
Ceph Packaging
Buster does not include a recent version of Ceph and Ceph does not provide packages for Buster (tracked at https://tracker.ceph.com/issues/42870).
Croit provides Debian Buster packages that are built using the Ceph build process and available to the community https://croit.io/2019/07/07/2019-07-07-debian-mirror https://github.com/croit/ceph. These packages have been mirrored to our local APT repository.
Puppet
Roles
- wmcs::ceph::mon Deploys the Ceph monitor and manager daemon to support CloudVPS hypervisors
- wmcs::ceph::osd Deploys the Ceph object storage daemon to support CloudVPS hypervisors
- role::wmcs::openstack::eqiad1::virt_ceph Deploys nova-compute configured with RBD based virtual machines (Note that switching between this role and wmcs::openstack::eqiad1::virt should work gracefully without a rebuild)
Profiles
- profile::ceph::client::rbd Install and configure a Ceph RBD client
- profile::ceph::osd Install and configure Ceph object storage daemon
- profile::ceph::mon Install and configure Ceph monitor and manager daemon
- profile::ceph::alerts Configure Ceph cluster alerts in Icinga
Modules
- ceph Install and configure the base Ceph installation used by all services and clients
- ceph::admin Configures the Ceph administrator keyring
- ceph::mgr Install and configure the Ceph manager daemon
- ceph::mon Install and configure the Ceph monitor daemon
- ceph::keyring Defined resource that manages access control and keyrings
Post Installation Procedures
Adding OSDs
Using cookbooks
Pre-flight checks
You have to make sure that the following is setup correctly:
- Add the host to the site.pp puppet file (so it gets the
wmcs::ceph::osd
role). - Add the host info to the hiera variable
profile::ceph::osd::hosts
(currently underoperations/puppet
repo, for examplehieradata/eqiad/profile/ceph/osd.yaml
), note that you'll have to add it to the section according to the rack it's on, and manually assign the next free ip for the internal interface. - On netbox, you have to manually add an entry to the ip block for the cloud-private network according to it's rack:
Run the cookbook
Once that is merged, you can run the cookbook, from your laptop or from the cloudcumin
node:
dcaro@vulcanus$ cookbook wmcs.ceph.osd.bootstrap_and_add -h
usage: cookbook [GLOBAL_ARGS] wmcs.ceph.osd.bootstrap_and_add [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] --cluster-name {eqiad1,codfw1} --osd-hostname OSD_HOSTNAME [--skip-reboot] [--only-check] [--yes-i-know-what-im-doing] [--batch-size BATCH_SIZE] [--wait-for-rebalance] [--force]
WMCS Ceph - Bootstrap a new osd
Usage example:
cookbook wmcs.ceph.osd.bootstrap_and_add \
--new-osd-fqdn cloudcephosd1016.eqiad.wmnet \
--task-id T12345
options:
-h, --help show this help message and exit
--project PROJECT Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg calls. Default is 'admin'.
--task-id TASK_ID Id of the task related to this operation (ex. T123456). (default: None)
--no-dologmsg To disable dologmsg calls (no SAL messages on IRC). (default: False)
--cluster-name {eqiad1,codfw1}
Ceph cluster to roll restart. (default: None)
--osd-hostname OSD_HOSTNAME
Hostname of the new OSDs to add. Repeat for each new OSD. If specifying more than one, consider passing --yes-i-know-what-im-doing (default: None)
--skip-reboot If passed, will not do the first reboot before adding the new osds. Useful when the machine has already some running OSDs and you are sure the reboot is not needed. (default: False)
--only-check If passed, will only run the pre-setup checks on the host and report back, nothing more. (default: False)
--yes-i-know-what-im-doing
If passed, will not ask for confirmation. WARNING: this might cause data loss, use only when you are sure what you are doing. (default: False)
--batch-size BATCH_SIZE
Number of osds to bring up at a time to avoid congesting the network, use 0 for all at once. (default: 4)
--wait-for-rebalance If passed, will wait for the cluster to do the rebalancing after adding the new OSDs. Note that this might take several hours. (default: False)
--force If passed, will continue even if the cluster is not in a healthy state. (default: False)
An example run would be:
cookbook wmcs.ceph.osd.bootstrap_and_add --cluster-name eqiad1 --task-id T371878 --osd-hostname cloudcephosd1035 --wait-for-rebalance
Manually
Ensure OSD nodes can talk to Ceph nodes:
root@cloudcephmon2001-dev: ceph auth import -i /var/lib/ceph/bootstrap-osd/ceph.keyring
Locate available disks with lsblk
cloudcephosd1001:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 223.6G 0 disk
├─sda1 8:1 0 46.6G 0 part
│ └─md0 9:0 0 46.5G 0 raid1 /
├─sda2 8:2 0 954M 0 part
│ └─md1 9:1 0 953M 0 raid1 [SWAP]
└─sda3 8:3 0 176.1G 0 part
└─md2 9:2 0 176G 0 raid1
└─cloudcephosd1001--vg-data 253:2 0 140.8G 0 lvm /srv
sdb 8:16 0 223.6G 0 disk
├─sdb1 8:17 0 46.6G 0 part
│ └─md0 9:0 0 46.5G 0 raid1 /
├─sdb2 8:18 0 954M 0 part
│ └─md1 9:1 0 953M 0 raid1 [SWAP]
└─sdb3 8:19 0 176.1G 0 part
└─md2 9:2 0 176G 0 raid1
└─cloudcephosd1001--vg-data 253:2 0 140.8G 0 lvm /srv
sdc 8:80 0 1.8T 0 disk
sdd 8:96 0 1.8T 0 disk
sde 8:80 0 1.8T 0 disk
sdf 8:80 0 1.8T 0 disk
sdg 8:96 0 1.8T 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
To prepare a disk for Ceph first zap the disk
cloudcephosd1001:~# ceph-volume lvm zap /dev/sdc
--> Zapping: /dev/sdc
--> --destroy was not specified, but zapping a whole device will remove the partition table
Running command: /bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10
stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.00357845 s, 2.9 GB/s
--> Zapping successful for: <Raw Device: /dev/sdc>
Then prepare, activate and start the new OSD
cloudcephosd1001:~# ceph-volume lvm create --bluestore --data /dev/sde
You may need to explicitly set the osd class to SSD. First check how it displays with
ceph osd tree
If incorrect, set to ssd with:
ceph osd crush rm-device-class osd.<number>
ceph osd crush set-device-class ssd osd.<number>
Common issues
OSDs not joining due to missed heartbeats
This happened only once, but it was tricky to figure out. The behavior is that newly added osds start joining the cluster, but fail to reply/receive heartbeats from other osds and end up getting flagged out by the mon nodes.
On the osd logs you could see messages like:
>$ journalctl -f -u ceph-osd@191.service ... Aug 17 11:03:37 cloudcephosd1025 ceph-osd[15007]: 2022-08-17 11:03:37.177 7f3559652700 -1 osd.191 17192523 heartbeat_check: no reply from 10.64.20.58:6826 osd.0 ever on either front or back, first ping sent 2022-08-17 11:03:15.776557 (oldest deadline 2022-08-17 11:03:35.776557) Aug 17 11:03:38 cloudcephosd1025 ceph-osd[15007]: 2022-08-17 11:03:38.133 7f3559652700 -1 osd.191 17192523 heartbeat_check: no reply from 10.64.20.58:6826 osd.0 ever on either front or back, first ping sent 2022-08-17 11:03:15.776557 (oldest deadline 2022-08-17 11:03:35.776557) Aug 17 11:03:38 cloudcephosd1025 ceph-osd[15007]: 2022-08-17 11:03:38.133 7f3559652700 -1 osd.191 17192523 heartbeat_check: no reply from 10.64.20.58:6826 osd.0 ever on either front or back, first ping sent 2022-08-17 11:03:15.776557 (oldest deadline 2022-08-17 11:03:35.776557) ...
The issue this time was that the jumbo frames were not allowed by the network gear on some routes (task T315446), a quick way of checking could be:
>$ for ip in $(grep addr /etc/ceph/ceph.conf | awk '{print $4}'); do ping -M do -4 -c 1 -W 1 -s 8972 $ip 1>/dev/null 2>&1 && echo "$ip: ok" || echo "$ip: nook"; done
Creating Pools
To create a new storage pool you will first need to determine the number of placement groups that will be assigned to the new pool. You can use the calculator at https://ceph.io/pgcalc/ to help identify the starting point (not you can easily increase, but not decrease this value):
sudo ceph osd pool create eqiad1-compute 512
Enable the RBD application for the new pool
sudo ceph osd pool application enable eqiad1-compute rbd
Configuring Client Keyrings
The Cloud VPS hypervisors connect to the compute pool using RBD. After running the following command the keyring data should be stored in private hiera file: eqiad/profile/ceph/client/rbd.yaml with profile::ceph::client::rbd::keydata key. (Note all nova-compute hypervisors are configured to use the same Ceph keyring)
cloudcephmon1001:~$ sudo ceph auth get-or-create client.eqiad1-compute mon 'profile rbd' osd 'profile rbd pool=compute'
As of 2021-02-23 many of the caps are managed by hand rather than puppetized. To see the current state use
cloudcephmon1001:~$ ceph auth ls
To adjust caps, use 'ceph auth caps'. Here's an example of adding r/w rbd access to one pool and r/o access to another pool to the client.eqiad1-compute keyring:
root@cloudcephmon1001:~# ceph auth caps client.eqiad1-compute mon 'profile rbd' osd 'profile rbd-read-only pool=eqiad1-glance-images, profile rbd pool=eqiad1-compute'
updated caps for client.eqiad1-compute
Monitoring
The Ceph managers are configured to run a Prometheus exporter that exports global cluster metrics. This exporter is only active on one manager at a time, failing over between active managers is handled directly by Ceph. The cloudmetrics Prometheus server is configured to scrape this exporter.
Logs
You can find all related logs here:
Dashboards
You can find all the ceph related dashboards.
Icinga alerts
These alerts are global to the Ceph cluster and configured on the icinga1001 host to avoid getting multiple notifications for a single event.
Ceph Cluster Health
- Description
- Ceph storage cluster health check
- Status Codes
-
- 0 - healthy, all services are healthy
- 1 - warn, cluster is running in a degraded state, data is still accessible
- 2 - critical, cluster is failed, some or all data is inaccessible
- Next steps
-
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
sudo ceph --status
. Example output from a healthy cluster:
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
cloudcephmon1001:~$ sudo ceph --status cluster: id: 5917e6d9-06a0-4928-827a-f489384975b1 health: HEALTH_OK services: mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 3w) mgr: cloudcephmon1002(active, since 10d), standbys: cloudcephmon1003, cloudcephmon1001 osd: 24 osds: 24 up (since 3w), 24 in (since 3w) data: pools: 1 pools, 256 pgs objects: 3 objects, 19 B usage: 25 GiB used, 42 TiB / 42 TiB avail pgs: 256 active+clean
Note: As of 2020-09-10, ceph health checks will sometimes incorrectly report a small number of slow ops. Restarting the active monitor service will clean up these warnings -- that's a good thing to try before obsessing over a slow ops warning.
Ceph Monitor Quorum
- Description
- Verify there are enough Ceph monitor daemons running for proper quorum
- Status Codes
-
- 0 - healthy, 3 or more Ceph Monitors are running
- 2 - critical, Less than 3 Ceph monitors are running
- Next steps
-
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
sudo ceph mon stat
. Example output from a healthy cluster:
- On one of the ceph monitor hosts (e.g. cloudcephmon1001.wikimedia.org) check the output of the command
cloudcephmon1001:~$ sudo ceph mon stat e1: 3 mons at {cloudcephmon1001=[v2:208.80.154.148:3300/0,v1:208.80.154.148:6789/0],cloudcephmon1002=[v2:208.80.154.149:3300/0,v1:208.80.154.149:6789/0],cloudcephmon1003=[v2:208.80.154.150:3300/0,v1:208.80.154.150:6789/0]}, election epoch 24, leader 0 cloudcephmon1001, quorum 0,1,2 cloudcephmon1001,cloudcephmon1002,cloudcephmon1003
Maintenance
Restarting Ceph nodes
You can use the cookbooks for roll rebooting ceph node types, they will take care of making sure the cluster is healthy at every step, silencing alert during the reboot, etc.:
dcaro@vulcanus$ cookbook -l wmcs.ceph.roll
cookbooks
`-- wmcs
`-- wmcs.ceph
|-- wmcs.ceph.roll_reboot_mons
`-- wmcs.ceph.roll_reboot_osds
Restarting Ceph services
Always ensure the cluster is healthy before restarting any services
user@cloudcephosd1001:~$ sudo ceph -s
cluster:
id: 5917e6d9-06a0-4928-827a-f489384975b1
health: HEALTH_OK
services:
mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 14m)
mgr: cloudcephmon1002(active, since 17m), standbys: cloudcephmon1001, cloudcephmon1003
osd: 24 osds: 24 up (since 10w), 24 in (since 10w)
data:
pools: 1 pools, 256 pgs
objects: 41.22k objects, 160 GiB
usage: 502 GiB used, 41 TiB / 42 TiB avail
pgs: 256 active+clean
io:
client: 25 KiB/s wr, 0 op/s rd, 1 op/s wr
ceph-crash
The ceph-crash service can be restarted on all hosts without any service interruption
user@cumin1001:~$ sudo cumin 'P{R:Class = role::wmcs::ceph::mon or R:Class = role::wmcs::ceph::osd}' 'systemctl restart ceph-crash'
[..]
ceph-mgr
The ceph-mgr service can be restarted one by one across the ceph-mon hosts
user@cumin1001:~$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1001'
user@cumin1001:~$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1002'
user@cumin1001:~$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'systemctl restart ceph-mgr@cloudcephmon1003'
ceph-mon
IMPORTANT: This process should be run by the WMCS team
IMPORTANT: Only one Ceph-mon can be offline at anytime.
The ceph-mon service can be restarted one by one across the ceph-mon hosts, in between restarts ensure the services have rejoined the cluster
user@cumin1001:~$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1001'
user@cumin1001:~$ sudo cumin 'cloudcephmon1001.wikimedia.org' 'ceph -s'
user@cumin1001:~$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1002'
user@cumin1001:~$ sudo cumin 'cloudcephmon1002.wikimedia.org' 'ceph -s'
user@cumin1001:~$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'systemctl restart ceph-mon@cloudcephmon1003'
user@cumin1001:~$ sudo cumin 'cloudcephmon1003.wikimedia.org' 'ceph -s'
Example healthy output from ceph -s
... services: mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 7s) mgr: cloudcephmon1002(active, since 2m), standbys: cloudcephmon1001, cloudcephmon1003 osd: 24 osds: 24 up (since 10w), 24 in (since 10w) ...
ceph-osd
IMPORTANT: This process should be run by the WMCS team
IMPORTANT: Restarting multiple ceph-osd processes together can cause service outage or heavy network utilization during data rebalancing.
Until we have more Ceph OSD servers, we'll want to restart each OSD one at a time to ensure the cluster remains healthy.
Identify the systemctl unit files for each OSD
user@cloudcephosd1001:~$ systemctl | grep ceph-osd@[0-9]
ceph-osd@0.service loaded active running Ceph object storage daemon osd.0
ceph-osd@1.service loaded active running Ceph object storage daemon osd.1
ceph-osd@12.service loaded active running Ceph object storage daemon osd.12
ceph-osd@15.service loaded active running Ceph object storage daemon osd.15
ceph-osd@18.service loaded active running Ceph object storage daemon osd.18
ceph-osd@21.service loaded active running Ceph object storage daemon osd.21
ceph-osd@3.service loaded active running Ceph object storage daemon osd.3
ceph-osd@9.service loaded active running Ceph object storage daemon osd.9
Restart the ceph-osd services one by one
user@cloudcephosd1001:~$ sudo systemctl restart ceph-osd@0
Ceph health after restarting a service
user@cloudcephosd1001:~$ sudo ceph -s
cluster:
id: 5917e6d9-06a0-4928-827a-f489384975b1
health: HEALTH_WARN
1 osds down
...
Once the OSD is back online the data verification begins
user@cloudcephosd1001:~$ sudo ceph -s
cluster:
id: 5917e6d9-06a0-4928-827a-f489384975b1
health: HEALTH_WARN
Degraded data redundancy: 1842/123657 objects degraded (1.490%), 11 pgs degraded
...
Within a few seconds the OSD should be fully back online and healthy
user@cloudcephosd1001:~$ sudo ceph -s
cluster:
id: 5917e6d9-06a0-4928-827a-f489384975b1
health: HEALTH_OK
...
Removing or Replacing a failed OSD drive
The process to remove or replace a failed OSD are the same. Both the OSD configuration and service are removed from the storage cluster.
- Locate the failed OSD (osd.0 in this example)
cloudcephmon1001:~$ sudo ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 41.90625 root default -3 13.96875 host cloudcephosd1001 0 ssd 1.74609 osd.0 down 1.00000 1.00000 1 ssd 1.74609 osd.1 up 1.00000 1.00000 3 ssd 1.74609 osd.3 up 1.00000 1.00000 9 ssd 1.74609 osd.9 up 1.00000 1.00000 12 ssd 1.74609 osd.12 up 1.00000 1.00000 15 ssd 1.74609 osd.15 up 1.00000 1.00000 18 ssd 1.74609 osd.18 up 1.00000 1.00000 21 ssd 1.74609 osd.21 up 1.00000 1.00000 -5 13.96875 host cloudcephosd1002 2 ssd 1.74609 osd.2 up 1.00000 1.00000 4 ssd 1.74609 osd.4 up 1.00000 1.00000 5 ssd 1.74609 osd.5 up 1.00000 1.00000 10 ssd 1.74609 osd.10 up 1.00000 1.00000 13 ssd 1.74609 osd.13 up 1.00000 1.00000 16 ssd 1.74609 osd.16 up 1.00000 1.00000 19 ssd 1.74609 osd.19 up 1.00000 1.00000 22 ssd 1.74609 osd.22 up 1.00000 1.00000 -7 13.96875 host cloudcephosd1003 6 ssd 1.74609 osd.6 up 1.00000 1.00000 7 ssd 1.74609 osd.7 up 1.00000 1.00000 8 ssd 1.74609 osd.8 up 1.00000 1.00000 11 ssd 1.74609 osd.11 up 1.00000 1.00000 14 ssd 1.74609 osd.14 up 1.00000 1.00000 17 ssd 1.74609 osd.17 up 1.00000 1.00000 20 ssd 1.74609 osd.20 up 1.00000 1.00000 23 ssd 1.74609 osd.23 up 1.00000 1.00000
- Stop the ceph-osd service on the server with the failed OSD
cloudcephosd1001:~$ sudo systemctl stop ceph-osd@0
- Remove the OSD from Ceph
cloudcephosd1001:~$ sudo ceph osd out 0
- Verify that Ceph is rebuilding the placement groups
cloudcephosd1001:~$ sudo ceph health -w
- Remove the OSD from the CRUSH map
cloudcephosd1001:~$ sudo ceph osd crush remove osd.0
- Remove the OSD authentication keys
cloudcephosd1001:~$ sudo ceph auth del osd.0
- Remove the OSD from Ceph
cloudcephosd1001:~$ sudo ceph osd rm osd.0
- Replace the physical drive
- Follow the process to add a new OSD once the drive is replaced
Clients
Rate limiting
Native
Native RBD rate limiting is supported in the Ceph Nautilus release. Due to upstream availability and multiple Debian releases we will likely have a mixture of older Ceph client versions during phase1.
Available rate limiting options and their defaults in the Nautilus release:
$ rbd config pool ls <pool> | grep qos rbd_qos_bps_burst 0 config rbd_qos_bps_limit 0 config rbd_qos_iops_burst 0 config rbd_qos_iops_limit 0 config rbd_qos_read_bps_burst 0 config rbd_qos_read_bps_limit 0 config rbd_qos_read_iops_burst 0 config rbd_qos_read_iops_limit 0 config rbd_qos_schedule_tick_min 50 config rbd_qos_write_bps_burst 0 config rbd_qos_write_bps_limit 0 config rbd_qos_write_iops_burst 0 config rbd_qos_write_iops_limit 0 config
OpenStack
IO rate limiting can also be managed using a flavor's metadata. This will trigger libvirt to apply `iotune` limits on the ephemeral disk.
- Available disk tuning options
- disk_read_bytes_sec
- disk_read_iops_sec
- disk_write_bytes_sec
- disk_write_iops_sec
- disk_total_bytes_sec
- disk_total_iops_sec
Example commands to create or modify flavors metadata with rate limiting options roughly equal to a 7200RPM SATA Disk:
openstack flavor create \ --ram 2048 \ --disk 20 \ --vcpus 1 \ --private \ --project testlabs \ --id 857921a5-f0af-4069-8ad1-8f5ea86c8ba2 \ --property quota:disk_total_iops_sec=200 m1.small-ceph
openstack flavor set --property quota:disk_total_bytes_sec=250000000 857921a5-f0af-4069-8ad1-8f5ea86c8ba2
The Ceph host aggregate will pin flavors to the Hypervisors configured to use Ceph
openstack flavor set --property aggregate_instance_extra_specs:ceph=true 857921a5-f0af-4069-8ad1-8f5ea86c8ba2
Example rate limit configuration as seen by libvirt. (virsh dumpxml <instance name>)
<target dev='sda' bus='virtio'/> <iotune> <total_bytes_sec>250000000</total_bytes_sec> <total_iops_sec>200</total_iops_sec> </iotune>
NOTE: Updating a flavors metadata does not have any effect on existing virtual machines.
When a virtual machine is created the requested flavor data is copied to the instance, and any future updates to the flavor are ignored. This script will connect to the Nova database and directly modify extra specs.
$ wmcs-vm-extra-specs --help usage: wmcs-vm-extra-specs [-h] [--nova-db-server NOVA_DB_SERVER] [--nova-db NOVA_DB] [--mysql-password MYSQL_PASSWORD] uuid {quota:disk_read_bytes_sec, quota:disk_read_iops_sec, quota:disk_total_bytes_sec, quota:disk_total_iops_sec, quota:disk_write_bytes_sec, quota:disk_write_iops_sec} spec_value
Glance
Glance images used by Ceph based virtual machines should have the hw_scsi_model=virtio-scsi hw_disk_bus=scsi properties defined. These properties will enable SCSI discard operations and instruct Ceph to remove blocks that have been deleted.
Image create example
openstack image create --file debian-buster-20191218.qcow2 \ --disk-format "qcow2" \ --property hw_scsi_model=virtio-scsi \ --property hw_disk_bus=scsi \ --public \ debian-10.0-buster
(Once Glance is configured to store images in Ceph, the disk-format will change from qcow2 to raw)
Nova compute
Migrating local VMs to Ceph
Switch puppet roles to the Ceph enabled wmcs::openstack::eqiad1::virt_ceph role. In operations/puppet/manifest/site.pp:
node 'cloudvirt1022.eqiad.wmnet' { role(wmcs::openstack::eqiad1::virt_ceph) }
Run the puppet agent on the hypervisor
hypervisor $ sudo puppet agent -tv
Shutdown the VM
cloudcontrol $ openstack server stop <UUID>
Convert the local QCOW2 image to raw and upload to Ceph
hypervisor $ qemu-img convert -f qcow2 -O raw /var/lib/nova/instances/<UUID>/disk rbd:compute/<UUID>_disk:id=eqiad1-compute
Undefine the virtual machine. This command removes the existing libvirt definition from the hypervisor, once nova attempts to start the VM it will be redefined with the RBD configuration. (This step can be ignored, but you may notice some errors in nova-compute.log until the VM has been restarted)
hypervisor $ virsh undefine <OS-EXT-SRV-ATTR:instance_name>
Cleanup local storage files
hypervisor $ rm /var/lib/nova/instances/<UUID>/disk hypervisor $ rm /var/lib/nova/instances/<UUID>/disk.info
Power on the VM
cloudcontrol $ openstack server start <UUID>
Reverting back to local storage
Shutdown the VM
cloudcontrol $ openstack server stop <UUID>
Convert the Ceph raw image back to QCOW2 on the local hypervisor
hypervisor $ qemu-img convert -f raw -O qcow2 rbd:compute/<UUID>_disk:id=eqiad1-compute /var/lib/nova/instances/UUID/disk
Power on the VM
cloudcontrol $ openstack server start <UUID>
CPU Model Type
A virtual machine can only be live migrated to a hypervisor matching the same CPU. CloudVPS currently has multiple CPU models and is using the default "host-model" nova configuration.
To enable live migration between any production hypervisor, the cpu_mode parameter should match the lowest hypervisor CPU model.
Hypervisor range | CPU model | Launch date |
---|---|---|
cloudvirt[1023-1030].eqiad.wmnet | Gold 6140 Skylake | 2017 |
cloudvirt[1016-1022].eqiad.wmnet | E5-2697 v4 Broadwell | 2016 |
cloudvirt[1012-1014].eqiad.wmnet | E5-2697 v3 Haswell | 2014 |
cloudvirt[1001-1009].eqiad.wmnet | E5-2697 v2 Ivy Bridge | 2013 |
Virtual Machine Images
Important: Using QCOW2 for hosting a virtual machine disk is NOT recommended. If you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), please use the raw image format within Glance.
Once all CloudVPS virtual machines have been migrated to Ceph we can convert the existing virtual machine images in Glance from QCOW2 to raw. This will avoid having nova-compute convert the image each time a new virtual machine is created.
Object Storage via Radosgw
Cloud-vps provides S3 and swift-compatible object storage with auth and discovery managed by Openstack Keystone. These APIs are served by the Ceph rados gateway.
Radosgw runs on cloudcontrols and is configured using standard ceph config files. Its keyring is /etc/ceph/ceph.client.radosgw.keyring and its major config is in the [client.radosgw] section of /etc/ceph/ceph.conf. Both of the above are managed by puppet.
Even though the swift object storage endpoint emulates and openstack service (and can be discovered via keystone), it is managed by different commands. For example, radosgw quotas are managed with the radosgw-admin tool.
VirtIO SCSI devices
Currently CloudVPS virtual machines are configured with the virtio-blk driver. This driver does not support discard/trim operations to free up deleted blocks.
Discard support can be enabled by using the virtio-scsi driver, but it's important to note that the device labels will change from /dev/vda to /dev/sda.
IO Throttles in Nova Flavors
To avoid bandwidth and iops contention between VMs, most VMs are throttled with the following limits:
disk_total_bytes_sec: 10000000 disk_read_iops_sec: 5000 disk_write_iops_sec: 500
Any newly created flavor should be altered with the following commands:
openstack flavor set --property aggregate_instance_extra_specs:ceph=true <flavorid> openstack flavor set --property quota:disk_total_bytes_sec='200000000' <flavorid> openstack flavor set --property quota:disk_write_iops_sec='500' <flavorid> openstack flavor set --property quota:disk_read_iops_sec='5000' <flavorid>
Those numbers are based on the following assumptions:
- Support up to 1000 VMs of which up to third may be reading or writing at any one instant
- OSD drives support 42000 write iops and 74000 read iops
- Total ceph cluster supports up to 560000 write iops and 8880000 read iops
- 42000 * 120 (OSDs) /3 /3 = 560000
- 74000 * 120 = 8880000
- source for these formulas: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015479.html
- 10Gb/s duplex networking divided among all VMs
- The 'total_bytes_sec' metric assumes many fewer VMs are reading or writing flat out -- 1/20th the number used for iops. This is based partly on optimism and partly on new instance creation being unacceptably slow with a smaller number.
Some of the math for all this can be found at https://docs.google.com/spreadsheets/d/1_fRqCLLA8zBJP9pFnPEj4x-6aG5yM_ZMFbTHPZTPSeI/edit#gid=1136514968
Performance Testing
Network
Jumbo frames (9k MTU) have been configured to improve the network throughput and overall network performance, as well as reduce the CPU utilization on the Ceph OSD servers.
Baseline (default tuning options)
Iperf options used to simulate Ceph storage IO.
-N disable Nagle's Algorithm -l 4M set read/write buffer size to 4 megabyte -P number of parallel client threads to run (one per OSD)
Server:
iperf -s -N -l 4M
Client:
iperf -c <server> -N -l 4M -P 8
cloudcephosd <-> cloudcephosd
[ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 2.74 GBytes 2.35 Gbits/sec [ 10] 0.0-10.0 sec 2.74 GBytes 2.35 Gbits/sec [ 9] 0.0-10.0 sec 664 MBytes 557 Mbits/sec [ 6] 0.0-10.0 sec 720 MBytes 603 Mbits/sec [ 5] 0.0-10.0 sec 1.38 GBytes 1.18 Gbits/sec [ 13] 0.0-10.0 sec 1.38 GBytes 1.18 Gbits/sec [ 7] 0.0-10.0 sec 720 MBytes 602 Mbits/sec [ 8] 0.0-10.0 sec 720 MBytes 603 Mbits/sec [SUM] 0.0-10.0 sec 11.0 GBytes 9.42 Gbits/sec
cloudvirt1022 -> cloudcephosd
cloudvirt1022 <-> cloudcephosd: 8.55 Gbits/sec [ ID] Interval Transfer Bandwidth [ 7] 0.0-10.0 sec 1.11 GBytes 949 Mbits/sec [ 6] 0.0-10.0 sec 1.25 GBytes 1.07 Gbits/sec [ 4] 0.0-10.0 sec 1.39 GBytes 1.19 Gbits/sec [ 9] 0.0-10.0 sec 1.24 GBytes 1.06 Gbits/sec [ 10] 0.0-10.0 sec 1.07 GBytes 920 Mbits/sec [ 5] 0.0-10.0 sec 1.36 GBytes 1.16 Gbits/sec [ 3] 0.0-10.0 sec 1.41 GBytes 1.21 Gbits/sec [ 8] 0.0-10.0 sec 1.17 GBytes 1.00 Gbits/sec [SUM] 0.0-10.0 sec 10.0 GBytes 8.55 Gbits/sec
Ceph RBD
To browse an exhaustive set of performance tests you can go to the cloud ceph test result explorer.
To run the tests you can download the scripts and code.
CLI examples
Create, format and mount a RBD image (useful for testing / debugging)
$ rbd create datatest --size 250 --pool compute --image-feature layering $ rbd map datatest --pool compute --name client.admin $ mkfs.ext4 -m0 /dev/rbd0 $ mount /dev/rbd0 /mnt/
$ umount /mnt $ rbd unmap /dev/rbd0 $ rbd rm compute/datatest
List RBD nova images
$ rbd ls -p compute 9e2522ca-fd5e-4d42-b403-57afda7584c0_disk
Show RBD image information
$ rbd info -p compute 9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk rbd image '9051203e-b858-4ec9-acfd-44b9e5c0ecb1_disk': size 20 GiB in 5120 objects order 22 (4 MiB objects) snapshot_count: 0 id: aec56b8b4567 block_name_prefix: rbd_data.aec56b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten op_features: flags: create_timestamp: Mon Jan 6 21:36:11 2020 access_timestamp: Mon Jan 6 21:36:11 2020 modify_timestamp: Mon Jan 6 21:36:11 2020
View RBD image with qemu tools on a hypervisor
$ qemu-img info rbd:<pool>/<vm uuid>_disk:id=<ceph user>