Data Platform/Systems/Ceph
The Data Platform Engineering team is currently building a pair of Ceph clusters for two primary purposes:
- To consider the use of Ceph and its S3 compatible interface as a replacement for (or addition to) Hadoop's HDFS file system
- To provide block storage capability to workloads running on the dse-k8s Kubernetes cluster.
We are also considering alternative use cases for the S3 based storage, such as Postgres backups and Flink checkpoint storage.
Project Status
The project is nearing a production state. We manage five servers in eqiad. There is a smaller cluster of three servers ready to be configured in codfw.
Here is the original design document for the project. The Phabricator epic ticket is for the eqiad cluster deployment is: T324660
Software Components
The following sections give a brief explanation of each of the four principal software components involved in our Ceph Storage Cluster, with a reference to how we run them.
Please refer to Ceph Architecture for a more in-depth explanation of each component. Note that we do not use CephFS on our cluster, so the Metadata Server (MDS) component has been omitted.
Monitor Daemons
Ceph monitor daemons are responsible for maintaining a cluster map, which keeps an up-to-date record of the cluster topology and the location of data objects.
The Paxos algorithm is used to ensure that a quorum of servers are in agreement about the contents of the cluster map.
We run a monitor daemon on all five of our Ceph servers in eqiad, so our quorum is three active mon servers.
Manager Daemons
Ceph manager daemons run alongside monitor daemons in order to provide additional monitoring and management interfaces. Only one manager daemon is ever active, but there may be several standby servers and ceph itself manages the election of a new mgr daemon to the active role. In our eqiad cluster we have five manager daemons, since we run one alongside each monitor daemon.
Object Storage Daemons
A Ceph cluster will contain many Object Storage Daemons (referred to as OSDs). There is usually a 1:1 relationship between an OSD and a physical storage device, such as a hard drive (HDD) or a solid-state drive (SSD). These OSDs may started and stopped individually, in order to bring storage devices into and out of service.
Our clusters use the Bluestore specification of OSD, rather than the deprecated Filestore specification. This means that each OSD maintains its own RocksDB database of object metatadata, complete with write ahead log (WAL). The physical location of the WAL and the block.db
database file can be specified independently of the OSD backing store, in order to optimise performance and availability. Please see https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/ for a more comprehensive explanation of Bluestore.
In our eqiad cluster we have 20 OSDs per host, for a total of 100 OSDs in the cluster. 60 of these are backed by hard drives and 40 by SSDs. We use a high-performance NVMe drive to host the WAL and RocksDB databases of the HDD backed OSDs.
Rados Gateway Daemons
The rados gateway daemons (referred to as radosgw
) are different from the mon
, mgr
, and osd
daemons in that they are not an internal component of the Ceph cluster. They are clients of the cluster.
They serve an HTTP interface that enables the Ceph Oject Gateway, which is the S3 and Swift compatible interface to the storage services.
In our eqiad cluster we run a rados gateway on each of our five hosts and the hostname of our S3/Swift interface is: https://rgw.eqiad.dpe.anycast.wmnet
Cluster Architecture
At present we run a co-located configuration on our cluster in eqiad comprising five hosts, each of which is in a 2U enclosure that is optimised for storage.
The server names are: cephosd100[1-5].eqiad.wmnet
and they are identical to each other in terms of hardware.
Each server contains two shelves, each containing twelve hot-swappable 3.5" drive bays. At the rear of the chassis are two hot-swappable bays, which contain drives that are used for the operating system.
Each host currently has a 10 Gbps network connection to its switch. If we start to hit this throughput ceiling, we can request to increase this to 25 Gbps by changing the optics.
Storage Device Configuration
Each of the five hosts has the following primary storage devices:
The specific storage devices are as follows:
Count | Capacity | Technology | Make/Model | Total Capacity | Use Case |
---|---|---|---|---|---|
12 | 18 TB | HDD | Seagate Exos X18 nearline SAS | 216 TB | Cold tier |
8 | 3.8 TB | SSD | Kioxia RM6 mixed-use | 30.4 TB | Hot tier |
That makes the raw capacity of the five-node cluster:
- Cold tier: 1.08 PB
- Hot tier: 152 TB
In order to increase the performance of the cold tier, which is backed by hard drives, we employ an NVMe device as a cache device. This is partitioned such that each of the HDD based OSD daemons can store its Bluestore database and journal on it, increasing performance of these systems considerably.
Ceph Software Configuration
Each of our hosts runs the following Ceph services:
- 1 monitor daemon (
ceph-mon
) - 1 manager daemon (
ceph-mgr
) - 20 object storage daemons (
ceph-osd
) - 1 crash monitoring daemon (
ceph-crash
) - 1 rados gateway daemon (
radosgw
)
btullis@cephosd1004:~$ pstree -T|egrep 'ceph|radosgw'
|-ceph-crash
|-ceph-mgr
|-ceph-mon
|-20*[ceph-osd]
|-radosgw
This cluster uses Ceph packages that are distributed by the upstream project at: https://docs.ceph.com/en/latest/install/get-packages/#apt and are integrated with reprepro.
Puppet is used to configure all of the daemons on this cluster.
Additional Services
Alongside the Ceph daemons, each of the hosts in the cluster runs the following components:
- An instance of envoy which is operating as a TLS termination endpoint for the local
radosgw
service. - An instance of bird, which is providing support for the anycast load-balancing that we use for the
radosgw
service.
Cluster Configuration
Configuration Methods
The cluster configuration is partly managed with puppet and partly by-hand.
Puppet Configuration
We intend to migrate more of the configuration into puppet over time. One thing to bear in mind is that we share the ceph puppet module with the WMCS team, as they also use it for their clusters. However, we apply different puppet profiles (ceph vs cloudceph) to our clusters, in order to allow for some variance in the configuration style. Be aware of this potential impact on the WMS team when modifying puppet. In addition to this, be aware that there is a cephadm profile and a cephadm module in puppet. These are not relevant to the configuration of the DPE Ceph cluster, as they are only in use for the new apus cluster that is managed by the Data Persistence team.
Elements of the cluster configuration that are managed by puppet:
- Ceph package installation.
- The
/etc/ceph/ceph.conf
configuration file. - All daemons (
mon
,mgr
,osd
,crash
,radosgw
) - Preparation and activation of the hardware storage devices beneath each of the
osd
daemons. - All Ceph client users (Note that this is a different concept from a Ceph object storage user, which are not yet managed with puppet).
Manual Configuration
The following are currently managed by hand on the cluster.
- Pool creation
- Associating pools to applications
- CRUSH maps and rules, which affect data placement
- Maintenance flags, such as
noout
radosgw
user management, for S3 and Swift access (Created task T374531to address this.)
CRUSH Rules
CRUSH is an acronym for Controlled Replication Under Scalable Hashing. It is the algorithm that determines where in the storage cluster any item of data should reside, including attributes such as the number of replicas of that item and/or any parity information that would allow the item to be reconstructed in the event of any loss of a storage device.
For more detailed information on CRUSH, please refer to https://docs.ceph.com/en/reef/rados/operations/crush-map/
CRUSH rules are used to generate CRUSH maps. The idea behind a CRUSH map is that the Ceph monitor (aka mon) servers load the CRUSH maps into memory and enables clients to locate data within the cluster. When a client requests data from a Ceph cluster, the mon responds with the location of the data including on which osd(s) the data resides. The idea is to avoid a network bottleneck, since the mon does not proxy the data itself. Clients communicate directly with the osd processes when reading and writing data. This is analagous to the way in which HDFS namenodes provide a metadata service for clients to communicate with HDFS datanodes.
We currently have the following two CRUSH rules in place on the DPE Ceph cluster.
btullis@cephosd1005:~$ sudo ceph osd crush rule ls
hdd
ssd
btullis@cephosd1004:~$ sudo ceph osd crush rule dump
[
{
"rule_id": 1,
"rule_name": "hdd",
"type": 1,
"steps": [
{
"op": "take",
"item": -4,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ssd",
"type": 1,
"steps": [
{
"op": "take",
"item": -6,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
The hdd
and ssd
rules are for using replicated pools, selecting for the corresponding device classes.
We have configured our cluster with buckets for row
and rack
awareness, so the CRUSH algorithm is aware of the host placement within the rows.
btullis@cephosd1005:~$ sudo ceph osd crush tree|grep -v 'osd\.'
ID CLASS WEIGHT TYPE NAME
-1 1149.93103 root default
-19 689.95862 row eqiad-e
-21 229.98621 rack e1
-3 229.98621 host cephosd1001
-22 229.98621 rack e2
-7 229.98621 host cephosd1002
-23 229.98621 rack e3
-10 229.98621 host cephosd1003
-20 459.97241 row eqiad-f
-24 229.98621 rack f1
-13 229.98621 host cephosd1004
-25 229.98621 rack f2
-16 229.98621 host cephosd1005
The osd objects (currently numbered 0-99) are then assigned to the host objects, so our data is always distributed between hosts, rows, and racks.
Pools
In a Ceph cluster, a pool is a logical partitioning of objects. Pools are associated with an application, specifically the mgr
, rbd
, radosgw
, or cephfs
applications.
We can list the pools with the ceph osd pool ls
or ceph osd lspools
commands.
btullis@cephosd1004:~$ sudo ceph osd lspools
2 .mgr
7 dse-k8s-csi-ssd
8 .rgw.root
9 eqiad.rgw.log
10 eqiad.rgw.control
11 eqiad.rgw.meta
12 eqiad.rgw.buckets.index
13 eqiad.rgw.buckets.data
14 eqiad.rgw.buckets.non-ec
All pool names beginning with a .
are reserved for use internally by the cluster, so do not attempt to modify these pools. In our case we have the .mgr pool which is in use by the
In our case, we have created the dse-k8s-csi-ssd
pool for use with the rbd
application and the kubernetes Integration. It is backed by SSDs and is a replicated pool with 3 replicas.
All of those pools with rgw in their name are related to the radosgw application and underpin the S3/Swift object storage capabilities.
Kubernetes Integration
We have enabled Kubernetes block devices on the dse-k8s cluster, by means of the Ceph-CSI (Container Storage Interface) project.
Object Storage
We have enabled the Ceph Object Gateway (radosgw) in order to provide S3 and Swift compatible APIs and object storage.