Ceph

Clusters

We're currently run five Ceph clusters in the production realm at WMF.


Hostnames	Docs	Data Centre	Primary Owners	Use Cases	Hosts	Status
cloudceph	Cloud VPS/Admin/Ceph	eqiad	WMCS	Block storage capability for Cloud VPS	3 MON 34 OSD	Production
cloudceph (-dev)		codfw	WMCS	Dev cluster for testing functionality and upgrades	3 MON 3 OSD	Production
ceph	Data Platform/Systems/Ceph	eqiad	Data Platform Engineering	Block storage for workloads on the dse-k8s kubernetes cluster Object storage using the S3 API in support of Analytics and similar workloads.	5 OSD	Pre-production
apus (moss-*)	Ceph/Cephadm	eqiad&codfw	Data Persistence	multisite RGW cluster serving S3 One Ceph cluster per site, deployed with cephadm	3 MON/MGR/OSD per site 2 RGW per site	Production

We had previously (around 2013) evaluated Ceph as a potential replacement for Swift, in support of distributed storage of media objects for Mediawiki. The evaluation did not result in a migration, so as of January 2023 Swift is still used for all Mediawiki object storage requirements.

Ceph Architecture

The Ceph documentation has a Intro to Ceph page as well as a detailed architecture page which are both recommended reads. The text below is intended as a simpler crash course of Ceph, specifically tailored to Wikimedia use cases.

Ceph has a low-level RADOS layer which stores named objects & key/value metadata, on top of which sit a number of different interfaces:

librados: Low-level access using a native interface for client programs explicitly written for Ceph
radosgw: an Amazon S3/Openstack Swift compatible HTTP gateway
RBD: a block device layer for e.g. hosting virtual machines
CephFS: a POSIX-compliant distributed file system

(note that the above interfaces are not interchangeable, i.e. you can't expect to store files using radosgw and then read them over CephFS or the other way around.)

A Ceph storage cluster has three essential daemon types:

Object Storage Device (OSD): stores data, handles data replication and reports health status on itself and its peers to the monitor. One OSD maps to one filesystem and usually to one block device (a disk or disk array). Therefore, it is expected for multiple OSDs to run on a server, one per each disk. OSDs also have the capability of journaling (spooling) writes to a separate file, a functionality that is being used to increase performance by putting journals on SSDs.
Monitor (mon): the initial contact point for clients & OSDs; maintains the cluster state & topology (the "cluster map"), which it distributes to clients and updates it based on events coming from OSDs ("osd 5 is down") or admin action. A cluster SPOF, however multiple mons can and is recommended to run on a high-availability setup (handled internally) to increase resiliency. A quorum (majority) of monitors must agree the cluster map, e.g. 3 out of 5, 4 out of 6, etc. It is recommended to run an odd number of monitors.
Metadata Server (MDS): stores metadata for CephFS. Optional, only needed if CephFS is needed.

Data come from Ceph clients using any of the above interfaces and are stored as objects in RADOS. RADOS has multiple pools to store data, each with different characteristics (e.g. replica count). Objects in a pool then map to a relatively small number of partitions called placement groups (PGs) which are then mapped via an algorithm called CRUSH [1] to a set of OSDs (replica count). Each of the (e.g. 3) OSDs then maps data to files in its (e.g. XFS) filesystem using an internal hierarchy that has nothing to do with the object's name or hierarchy.

Each RADOS client is aware of the whole cluster hierarchy (given to it by the monitor) and connects directly to the OSDs (and hence, servers) for reads & writes, thus eliminating SPOFs & centralized bottlenecks. Note, however, that for all intents and purposes, radosgw itself is such a client and hence radosgw servers are choke points for HTTP clients.

Troubleshooting

Ceph's cluster operations and troubleshooting guide are more comprehensive external resources; see also Ceph Object Gateway troubleshooting.

Note that "Ceph Object Gateway" in the official Ceph documentation is alternative terminology for the Rados Gateway (radosgw), which we use at WMF.

The following concepts may help with respect to any Ceph troubleshooting:

OSDMap: an OSD may have a status of up/down or in/out. Down means that Ceph is supposed to place data on that OSD but it can't reach it, so the PGs that are supposed to be stored there are in a "degraded" status. Ceph is self-healing and automatically "outs" a down OSD after a period of time (by default 15 minutes), i.e. it starts reallocating PGs to other OSDs in the cluster and recovering them from their replicas.
PGMap: a PG may have multiple states. The healthy ones are anything that contains active+clean (ignore scrubbing & scrubbing+deep, these indicate normal consistency checks run periodically). Others are also non-fatal, like recovering & remapped, but some (like incomplete) are especially bad; refer to the Ceph manual in case of doubt.
MONMap: a monitor may be down; in that case, ceph quorum_status will be useful in pinpointing the issue futher.

Clusters

Ceph Architecture

Troubleshooting

See also