Ceph/Cephadm

Cephadm is Ceph's new approach to cluster deployment and management, where all the Ceph daemons are deployed in containers. This page documents how Cephadm is used at WMF. For a full introduction to Cephadm, see the upstream documentation.

In outline, Puppet is used to template out service specification files, which are passed to Cephadm to bootstrap and later manage the cluster. Other necessary configuration (e.g. Envoy for TLS termination, Prometheus for metrics collection) is managed by Puppet in the usual way.

A particular feature of the "apus" service is that it is one S3 service provided by two Ceph clusters, one in eqiad and one in codfw. The two clusters participate in a multisite configuration based on one realm, one zonegroup, and two zones, and arrange for objects to be asynchronously replicated between the two data centers. This means that write latency (and Ceph-internal traffic, recovery, etc.) is not dependent on the inter-DC link.

Interacting with the cluster

Most commands you find in the documentation (that start ceph ...) need to be run from inside a suitable environment. You get one of these by running sudo cephadm shell on the controller node of the cluster (the cephadm::controller role in Puppet).

The first command to run in almost any situation is ceph -s, which gives you the cluster status. For example:

$ sudo cephadm shell
Inferring fsid 3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18
Inferring config /var/lib/ceph/3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18/mon.moss-be1001/config
Using ceph image with id '8a8e0d5b8d82' and tag '18.2.2-wmf7' created on 2024-08-23 13:25:25 +0000 UTC
docker-registry.wikimedia.org/ceph@sha256:544e15aa1d48d0801e69731ca7c925ecccf54d6ff503528ff9f4034067e94ec3
root@moss-be1001:/# ceph -s
  cluster:
    id:     3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum moss-be1001,moss-be1003,moss-be1002 (age 4w)
    mgr: moss-be1001.eshmpf(active, since 3w), standbys: moss-be1002.jadwfz, moss-be1003.yxfdls
    osd: 48 osds: 48 up (since 4w), 48 in (since 3M)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    pools:   9 pools, 2273 pgs
    objects: 673 objects, 525 MiB
    usage:   4.4 TiB used, 262 TiB / 266 TiB avail
    pgs:     2273 active+clean
 
  io:
    client:   39 KiB/s rd, 0 B/s wr, 44 op/s rd, 21 op/s wr

Here you can see cephadm shell figuring out which image to use and how to connect to the cluster, and then the ceph -s command showing the cluster in a healthy state.

Upstream has a useful general troubleshooting guide as well as documentation on cephadm-specific troubleshooting.

S3 / Ceph Object Gateway

The point of the "apus" cluster is to be an S3-compatible object store. The S3 endpoint is provided in Ceph by the Ceph Object Gateway, often referred to as RGW; upstream have a troubleshooting guide. Remember that we are using Envoy for TLS termination, so its dashboards can be helpful.

Generally, you want to use the apus.discovery.wmnet endpoint (which will direct you to the RGW in the nearest DC), but the same credentials will work with the DC-specific apus.svc.eqiad.wmnet and apus.svc.codfw.wmnet endpoints, which might be useful if you wanted to check an object had completed replicating, for instance.

The radosgw-admin command is useful for many administrative actions, and should be run from within a cephadm shell environment. Upstream has an admin guide that describes common operations.

It's worth noting again that we are using a multisite setup, with one realm ("apus") containing one zonegroup ("apus_zg") containing two zones ("eqiad" and "codfw"). Objects are asynchronously replicated between these two zones. You can use the radosgw-admin sync status command to check replication is working; if everything is caught up, the output will look like this:

$ sudo cephadm shell -- radosgw-admin sync status
Inferring fsid 3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18
Inferring config /var/lib/ceph/3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18/mon.moss-be1001/config
Using ceph image with id '8a8e0d5b8d82' and tag '18.2.2-wmf7' created on 2024-08-23 13:25:25 +0000 UTC
docker-registry.wikimedia.org/ceph@sha256:544e15aa1d48d0801e69731ca7c925ecccf54d6ff503528ff9f4034067e94ec3
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad)
   current time 2024-09-26T13:50:27Z
zonegroup features enabled: resharding
                   disabled: compress-encrypted
  metadata sync no sync (zone is master)
      data sync source: acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

If replication is lagging, it will be noted here, along with an estimate of how long the lag is.

Disk Failure

Disks fail. The usual sign of this will be an alert that a systemd unit (corresponding to the OSD on the failed disk) has failed, and the cluster will be in HEALTH_WARN state, with a cephadm warning about the failed daemon. Replacing the disk is pretty easy, with the wrinkle that you have to tell cephadm to stop managing the disk service before you start (otherwise, you'll wipe the old disk, and cephadm will notice that there's an available blank disk in the host and immediately try and build a new OSD on it!). So the process looks like:

ceph orch set-unmanaged osd.rrd_NVMe - Stop managing the disk service
ceph orch osd rm XX --zap - remove osd XX, wipe the associated storage
Wait until ceph orch osd rm status says the removal has finished
Ask the DC team to swap the drive
Make sure the new disk is visible to the OS (which may need sudo megacli -pdmakejbod -physdrv [YY:ZZ] -a0 and/or sudo megacli -CfgForeign -Clear -a0)
ceph orch set-managed osd.rrd_NVMe
Make a cup of tea (it takes a little while for cephadm to notice the new disk and build the new OSD)

Cephadm Configuration

Puppet templates out a number of configuration files into /etc/cephadm:

bootstrap-ceph.conf used in bootstrapping a new cluster, sets a CRUSH rule and which mgr modules are used
hosts.yaml defines the hosts in a cluster, used when adding new hosts
osd_spec.yaml tells Ceph how to use the storage on nodes to build OSDs
rgw_spec.yaml tells Ceph where to run RGWs
zone_spec.yaml defines the realm, zonegroup, zone, and endpoint for the S3 service

Most of these are only used for initial cluster setup, but might need adjusting if we need to change aspects of how the cluster is operated. The need for rgw_spec.yaml and zone_spec.yaml to be separate is because of a limitation in the rgw module of cephadm that you cannot specify zone endpoints and rgw placements in the same spec file.

Cluster info in hiera

While there are three Cephadm roles in Puppet (cephadm::storage, cephadm::controller, and cephadm::rgw), how clusters are actually assembled depends on the cephadm_clusters defined in hiera. As an example:

cephadm_clusters:
  apus:
    cluster_name: 'apus-eqiad'
    mon_network: 2620:0:861:100::/56
    controller: 'moss-be1001.eqiad.wmnet'
    osds: &osds
      - moss-be1001.eqiad.wmnet
      - moss-be1002.eqiad.wmnet
      - moss-be1003.eqiad.wmnet
    monitors: *osds
    rgws:
      - moss-fe1001.eqiad.wmnet
      - moss-fe1002.eqiad.wmnet

The cluster label (here "apus") should match profile::cephadm::cluster_label that is set in hiera for the nodes in that cluster. The mon_network is the network which all mons for the cluster must be on, and is the IPv6 private network for the relevant DC.

Image Building

Our image policy is that all images must be built locally. Rather than replicating upstream's complex machinery for image building, we build simpler images ourselves out of the [docker-images repository], based on the Debian packages pulled from upstream into our local apt repository.

Bootstrapping a Cluster

This documents how the apus clusters were bootstrapped, and ought to be repeatable for new clusters with suitable hosts and hiera ready, and puppet run on all cluster nodes.

Start on the controller node, and bootstrap an initial cluster containing the hosts (but no OSDs or RGWs):

sudo cephadm --image docker-registry.wikimedia.org/ceph:18.2.2-wmf7 bootstrap --skip-dashboard --skip-firewalld --cleanup-on-failure --skip-monitoring-stack --ssh-private-key /root/.ssh/id_cephadm --ssh-public-key /root/.ssh/id_cephadm.pub --mon-ip IPV6_OF_INITIAL_MON --config /etc/cephadm/bootstrap-ceph.conf --apply-spec /etc/cephadm/hosts.yaml

Adjusting the image version (and mon-ip) as necessary. If this completes successfully you should be able to run sudo cephadm shell and connect to the cluster, and then ceph -s should show you all the hosts you expect (as well as warnings about the lack of OSDs).

Then set up OSDs by telling cephadm about your OSD specification:

sudo cephadm shell -- ceph orch apply -i - < /etc/cephadm/osd_spec.yaml

Then go and make a cup of tea - it takes time to bring up all the OSDs successfully, and the cluster will be in some odd states in the mean time, which it's not worth worrying about!

If you want cluster metrics to be collected (which you almost certainly do), then enable the prometheus module and set a 60s scraping interval (as that's what we have configured in puppet for these clusters):

ceph config set mgr mgr/prometheus/scrape_interval 60.0
ceph mgr module enable prometheus

You now have a working Ceph cluster. If you're setting up a multisite setup, next go and set up the other cluster, and only proceed once both are in HEALTH_OK.

Set Up Master Zone

Bootstrap the new zone (due to a limitation in cephadm, specifying endpoints means no RGWs will be started):

sudo cephadm shell -- ceph rgw realm bootstrap -i - </etc/cephadm/zone_spec.yaml

That should finish with output like "Realm(s) created correctly. Please, use 'ceph rgw realm tokens' to get the token." Then you need to set up the zonegroup hostnames (which will be the discovery and dc-specific hostnames to be used - ceph needs to know what hostnames its serving), which unfortunately involves editing json:

radosgw-admin zonegroup get > jsonfile
edit jsonfile
radosgw-admin zonegroup set --infile /dev/stdin < jsonfile

You want to edit the jsonfile to update the "hostnames" section of the zonegroup, e.g.:

{
    "id": "19f169b0-5e44-4086-9e1b-0df871fbea50",
    "name": "apus_zg",
[...]
    "hostnames": [
        "apus.discovery.wmnet",
        "apus.svc.codfw.wmnet",
        "apus.svc.eqiad.wmnet"
    ],
[...]
}

Then you need to actually start up some RGWs by telling cephadm about your RGW spec file:

ceph orch apply -i < /etc/cephadm/rgw_spec.yaml

To go any further, you'll need to have got your service into LVS at least to the lvs_setup state, so that your endpoints in eqiad and codfw are reachable, that process is documented elsewhere.

Set Up Secondary Zone

Make sure you can reach your master zone from the secondary DZ (curl is a sufficient tool for testing this). Then you need a realm token from the master zone to set up the secondary. Get this by running against the master ceph cluster

sudo cephadm shell -- ceph rgw realm tokens

Take care with the output of this, as the "token" is a credential (it's base64-encoded, so you can paste it into base64 -d if you want to check that e.g. the endpoint is correct). Then you want to take /etc/cephadm/zone_spec.yaml and edit it to add a line

rgw_realm_token: "PASTE TOKEN HERE"

after the rgw_realm line - either pause puppet to stop it overwriting your change, or copy to a new location while you work (and delete once finished). Once that's done, create the new zone (in the existing zonegroup):

sudo cephadm shell -- ceph rgw zone create -i </etc/cephadm/zone_spec.yaml

If that works, then you just need to start some RGWs as before:

ceph orch apply -i < /etc/cephadm/rgw_spec.yaml

If it fails, it's worth double-checking that you can reach the relevant endpoint in the other DC (the error messages if you can't are really unhelpful).