Data Engineering/Systems/Ceph
The Data Engineering team is currently evaluating Ceph for two purposes:
- To consider the use of Ceph and its S3 compatible interface as a replacement for (or addition to) Hadoop's HDFS file system
- To provide block storage capability to workloads running on the dse-k8s Kubernetes cluster.
Project Status
The project is in a pre-production state. We have five servers in eqiad.
Here is the original design document
The Phabricator epic ticket is: T324660
Cluster Architecture
At present we run a co-located configuration with five hosts. Each host runs:
- 1 monitor daemon (mon)
- 20 object storage daemons (osd)
- 1 (or more) radosgw daemons
Each host has a 10 Gbps network connection to its switch. If we start to hit this throughput ceiling, we can request to increase this to 25 Gbps by changing the optics.
Storage Configuration
Each of the five hosts has the following primary storage devices:
Count | Capacity | Technology | Make/Model | Total Capacity | Use Case |
---|---|---|---|---|---|
12 | 18 TB | HDD | Seagate Exos X18 nearline SAS | 216 TB | Cold tier |
8 | 3.8 TB | SSD | Kioxia RM6 mixed-use | 30.4 TB | Hot tier |
That makes the raw capacity of the five-node cluster:
- Cold tier: 1.08 PB
- Hot tier: 152 TB
CRUSH Rule Configuration
CRUSH is an acronym for Controlled Replication Under Scalable Hashing. It is the algorithm that determines where in the storage cluster any item of data should reside, including attributes such as the number of replicas of that item and/or any parity information that would allow the item to be reconstructed in the event of any loss of a storage device.
For more detailed information on CRUSH, please refer to https://docs.ceph.com/en/reef/rados/operations/crush-map/
CRUSH rules are used to generate CRUSH maps. The idea behind a CRUSH map is that the Ceph monitor (aka mon) servers load the CRUSH maps into memory and enables clients to locate data within the cluster. When a client requests data from a Ceph cluster, the mon responds with the location of the data including on which osd(s) the data resides. The idea is to avoid a network bottleneck, since the mon does not proxy the data itself. Clients communicate directly with the osd processes when reading and writing data. This is analagous to the way in which HDFS namenodes provide a metadata service for clients to communicate with HDFS datanodes.
We currently have the following five CRUSH rules in place on the DPE Ceph cluster.
btullis@cephosd1005:~$ sudo ceph osd crush rule ls
hdd
ssd
rbd-data-ssd
rbd-data-hdd
The first two listed rules, hdd
and ssd
are for using replicated pools, selecting for the corresponding device classes.
The second two listed rules, rbd-data-ssd
and rbd-data-hdd
are for using erasure coded pools, selecting for the corresponding device classes.
We have configured our cluster with buckets for row
and rack
awareness, so the CRUSH algorithm is aware of the host placement within the rows.
btullis@cephosd1005:~$ sudo ceph osd crush tree|grep -v 'osd\.'
ID CLASS WEIGHT TYPE NAME
-1 1149.93103 root default
-19 689.95862 row eqiad-e
-21 229.98621 rack e1
-3 229.98621 host cephosd1001
-22 229.98621 rack e2
-7 229.98621 host cephosd1002
-23 229.98621 rack e3
-10 229.98621 host cephosd1003
-20 459.97241 row eqiad-f
-24 229.98621 rack f1
-13 229.98621 host cephosd1004
-25 229.98621 rack f2
-16 229.98621 host cephosd1005
The osd objects (currently numbered 0-99) are then assigned to the host objects, so our data is always distributed between hosts, rows, and racks.
Pool Configuration
TODO