Ceph

From Wikitech

Ceph is a software defined storage system, offering high levels of scalability and flexibility.

Clusters

We're currently run three Ceph clusters in the production realm at WMF.

Hostnames Data Centre Primary Owners Use Cases Hosts Status
cloudceph eqiad WMCS Block storage capability for Cloud VPS 3 MON

34 OSD

Production
cloudceph (-dev) codfw WMCS Dev cluster for testing functionality and upgrades 3 MON

3 OSD

Production
ceph eqiad Data Engineering Block storage for workloads on the dse-k8s kubernetes cluster

Object storage using the S3 API in support of Analytics and similar workloads.

5 OSD Pre-production

We had previously (around 2013) evaluated Ceph as a potential replacement for Swift, in support of distributed storage of media objects for Mediawiki. The evaluation did not result in a migration, so as of Janualry 2023 Swift is still used for all Mediawiki object storage requirements.

For the WMCS Ceph cluster details, please see: Portal:Cloud_VPS/Admin/Ceph

For the Data Engineering ceph cluster details, please see: Data Engineering/Systems/Ceph

Ceph Architecture

The Ceph documentation has a Getting Started page as well as a detailed architecture page which are both recommended reads. The text below is intended as a simpler crash course of Ceph, specifically tailored to Wikimedia use cases.

Ceph has a low-level RADOS layer which stores named objects & key/value metadata, on top of which sit a number of different interfaces:

  • librados: Low-level access using a native interface for client programs explicitly written for Ceph
  • radosgw: an Amazon S3/Openstack Swift compatible HTTP gateway
  • RBD: a block device layer for e.g. hosting virtual machines
  • CephFS: a POSIX-compliant distributed file system

(note that the above interfaces are not interchangeable, i.e. you can't expect to store files using radosgw and then read them over CephFS or the other way around.)

A Ceph storage cluster has three essential daemon types:

  • Object Storage Device (OSD): stores data, handles data replication and reports health status on itself and its peers to the monitor. One OSD maps to one filesystem and usually to one block device (a disk or disk array). Therefore, it is expected for multiple OSDs to run on a server, one per each disk. OSDs also have the capability of journaling (spooling) writes to a separate file, a functionality that is being used to increase performance by putting journals on SSDs.
  • Monitor (mon): the initial contact point for clients & OSDs; maintains the cluster state & topology (the "cluster map"), which it distributes to clients and updates it based on events coming from OSDs ("osd 5 is down") or admin action. A cluster SPOF, however multiple mons can and is recommended to run on a high-availability setup (handled internally) to increase resiliency. A quorum (majority) of monitors must agree the cluster map, e.g. 3 out of 5, 4 out of 6, etc. It is recommended to run an odd number of monitors.
  • Metadata Server (MDS): stores metadata for CephFS. Optional, only needed if CephFS is needed.

Data come from Ceph clients using any of the above interfaces and are stored as objects in RADOS. RADOS has multiple pools to store data, each with different characteristics (e.g. replica count). Objects in a pool then map to a relatively small number of partitions called placement groups (PGs) which are then mapped via an algorithm called CRUSH [1] to a set of OSDs (replica count). Each of the (e.g. 3) OSDs then maps data to files in its (e.g. XFS) filesystem using an internal hierarchy that has nothing to do with the object's name or hierarchy.

Each RADOS client is aware of the whole cluster hierarchy (given to it by the monitor) and connects directly to the OSDs (and hence, servers) for reads & writes, thus eliminating SPOFs & centralized bottlenecks. Note, however, that for all intents and purposes, radosgw itself is such a client and hence radosgw servers are choke points for HTTP clients.

Troubleshooting

Ceph's cluster operations and troubleshooting guide are more comprehensive external resources; see also Ceph Object Storage troubleshooting.

Note that "Ceph Object Storage" in the official Ceph documentation is alternative terminology for the Rados Gateway (radosgw), which we use at WMF.

The following concepts may help with respect to any Ceph troubleshooting:

  • OSDMap: an OSD may have a status of up/down or in/out. Down means that Ceph is supposed to place data on that OSD but it can't reach it, so the PGs that are supposed to be stored there are in a "degraded" status. Ceph is self-healing and automatically "outs" a down OSD after a period of time (by default 15 minutes), i.e. it starts reallocating PGs to other OSDs in the cluster and recovering them from their replicas.
  • PGMap: a PG may have multiple states. The healthy ones are anything that contains active+clean (ignore scrubbing & scrubbing+deep, these indicate normal consistency checks run periodically). Others are also non-fatal, like recovering & remapped, but some (like incomplete) are especially bad; refer to the Ceph manual in case of doubt.
  • MONMap: a monitor may be down; in that case, ceph quorum_status will be useful in pinpointing the issue futher.

See also