Data Engineering/Systems/Ceph

The Data Engineering team is currently evaluating Ceph for two purposes:

To consider the use of Ceph and its S3 compatible interface as a replacement for (or addition to) Hadoop's HDFS file system
To provide block storage capability to workloads running on the dse-k8s Kubernetes cluster.

Project Status

The project is in a pre-production state. We have five servers in eqiad.

Here is the original design document

The Phabricator epic ticket is: T324660

Cluster Architecture

At present we run a co-located configuration with five hosts. Each host runs:

1 monitor daemon (mon)
20 object storage daemons (osd)
1 (or more) radosgw daemons

Each host has a 10 Gbps network connection to its switch. If we start to hit this throughput ceiling, we can request to increase this to 25 Gbps by changing the optics.

Storage Configuration

Each of the five hosts has the following primary storage devices:


Count	Capacity	Technology	Make/Model	Total Capacity	Use Case
12	18 TB	HDD	Seagate Exos X18 nearline SAS	216 TB	Cold tier
8	3.8 TB	SSD	Kioxia RM6 mixed-use	30.4 TB	Hot tier

That makes the raw capacity of the five-node cluster:

Cold tier: 1.08 PB
Hot tier: 152 TB

CRUSH Rule Configuration

CRUSH is an acronym for Controlled Replication Under Scalable Hashing. It is the algorithm that determines where in the storage cluster any item of data should reside, including attributes such as the number of replicas of that item and/or any parity information that would allow the item to be reconstructed in the event of any loss of a storage device.

For more detailed information on CRUSH, please refer to https://docs.ceph.com/en/reef/rados/operations/crush-map/

CRUSH rules are used to generate CRUSH maps. The idea behind a CRUSH map is that the Ceph monitor (aka mon) servers load the CRUSH maps into memory and enables clients to locate data within the cluster. When a client requests data from a Ceph cluster, the mon responds with the location of the data including on which osd(s) the data resides. The idea is to avoid a network bottleneck, since the mon does not proxy the data itself. Clients communicate directly with the osd processes when reading and writing data. This is analagous to the way in which HDFS namenodes provide a metadata service for clients to communicate with HDFS datanodes.

We currently have the following five CRUSH rules in place on the DPE Ceph cluster.

btullis@cephosd1005:~$ sudo ceph osd crush rule ls
hdd
ssd
rbd-data-ssd
rbd-data-hdd

The first two listed rules, hdd and ssd are for using replicated pools, selecting for the corresponding device classes.

The second two listed rules, rbd-data-ssd and rbd-data-hdd are for using erasure coded pools, selecting for the corresponding device classes.

We have configured our cluster with buckets for row and rack awareness, so the CRUSH algorithm is aware of the host placement within the rows.

btullis@cephosd1005:~$ sudo ceph osd crush tree|grep -v 'osd\.'
ID   CLASS  WEIGHT      TYPE NAME                   
 -1         1149.93103  root default                
-19          689.95862      row eqiad-e             
-21          229.98621          rack e1             
 -3          229.98621              host cephosd1001
-22          229.98621          rack e2             
 -7          229.98621              host cephosd1002
-23          229.98621          rack e3             
-10          229.98621              host cephosd1003
-20          459.97241      row eqiad-f             
-24          229.98621          rack f1             
-13          229.98621              host cephosd1004
-25          229.98621          rack f2             
-16          229.98621              host cephosd1005

The osd objects (currently numbered 0-99) are then assigned to the host objects, so our data is always distributed between hosts, rows, and racks.

Pool Configuration

TODO