Ceph/Cephadm
Cephadm is Ceph's new approach to cluster deployment and management, where all the Ceph daemons are deployed in containers. This page documents how Cephadm is used at WMF. For a full introduction to Cephadm, see the upstream documentation.
In outline, Puppet is used to template out service specification files, which are passed to Cephadm to bootstrap and later manage the cluster. Other necessary configuration (e.g. Envoy for TLS termination, Prometheus for metrics collection) is managed by Puppet in the usual way.
A particular feature of the "apus" service is that it is one S3 service provided by two Ceph clusters, one in eqiad and one in codfw. The two clusters participate in a multisite configuration based on one realm, one zonegroup, and two zones, and arrange for objects to be asynchronously replicated between the two data centers. This means that write latency (and Ceph-internal traffic, recovery, etc.) is not dependent on the inter-DC link.
Interacting with the cluster
Most commands you find in the documentation (that start ceph ...
) need to be run from inside a suitable environment. You get one of these by running sudo cephadm shell
on the controller
node of the cluster (the cephadm::controller
role in Puppet).
The first command to run in almost any situation is ceph -s
, which gives you the cluster status. For example:
$ sudo cephadm shell Inferring fsid 3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18 Inferring config /var/lib/ceph/3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18/mon.moss-be1001/config Using ceph image with id '8a8e0d5b8d82' and tag '18.2.2-wmf7' created on 2024-08-23 13:25:25 +0000 UTC docker-registry.wikimedia.org/ceph@sha256:544e15aa1d48d0801e69731ca7c925ecccf54d6ff503528ff9f4034067e94ec3 root@moss-be1001:/# ceph -s cluster: id: 3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18 health: HEALTH_OK services: mon: 3 daemons, quorum moss-be1001,moss-be1003,moss-be1002 (age 4w) mgr: moss-be1001.eshmpf(active, since 3w), standbys: moss-be1002.jadwfz, moss-be1003.yxfdls osd: 48 osds: 48 up (since 4w), 48 in (since 3M) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 9 pools, 2273 pgs objects: 673 objects, 525 MiB usage: 4.4 TiB used, 262 TiB / 266 TiB avail pgs: 2273 active+clean io: client: 39 KiB/s rd, 0 B/s wr, 44 op/s rd, 21 op/s wr
Here you can see cephadm shell
figuring out which image to use and how to connect to the cluster, and then the ceph -s
command showing the cluster in a healthy state.
Upstream has a useful general troubleshooting guide as well as documentation on cephadm-specific troubleshooting.
S3 / Ceph Object Gateway
The point of the "apus" cluster is to be an S3-compatible object store. The S3 endpoint is provided in Ceph by the Ceph Object Gateway, often referred to as RGW
; upstream have a troubleshooting guide. Remember that we are using Envoy for TLS termination, so its dashboards can be helpful.
Generally, you want to use the apus.discovery.wmnet
endpoint (which will direct you to the RGW in the nearest DC), but the same credentials will work with the DC-specific apus.svc.eqiad.wmnet
and apus.svc.codfw.wmnet
endpoints, which might be useful if you wanted to check an object had completed replicating, for instance.
The radosgw-admin
command is useful for many administrative actions, and should be run from within a cephadm shell
environment. Upstream has an admin guide that describes common operations.
It's worth noting again that we are using a multisite setup, with one realm ("apus") containing one zonegroup ("apus_zg") containing two zones ("eqiad" and "codfw"). Objects are asynchronously replicated between these two zones. You can use the radosgw-admin sync status
command to check replication is working; if everything is caught up, the output will look like this:
$ sudo cephadm shell -- radosgw-admin sync status Inferring fsid 3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18 Inferring config /var/lib/ceph/3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18/mon.moss-be1001/config Using ceph image with id '8a8e0d5b8d82' and tag '18.2.2-wmf7' created on 2024-08-23 13:25:25 +0000 UTC docker-registry.wikimedia.org/ceph@sha256:544e15aa1d48d0801e69731ca7c925ecccf54d6ff503528ff9f4034067e94ec3 realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus) zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg) zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad) current time 2024-09-26T13:50:27Z zonegroup features enabled: resharding disabled: compress-encrypted metadata sync no sync (zone is master) data sync source: acc58620-9fac-476b-aee3-0f640100a3bb (codfw) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source
If replication is lagging, it will be noted here, along with an estimate of how long the lag is.
Disk Failure
Disks fail. The usual sign of this will be an alert that a systemd
unit (corresponding to the OSD on the failed disk) has failed, and the cluster will be in HEALTH_WARN
state, with a cephadm warning about the failed daemon. Replacing the disk is pretty easy, with the wrinkle that you have to tell cephadm to stop managing the disk service before you start (otherwise, you'll wipe the old disk, and cephadm will notice that there's an available blank disk in the host and immediately try and build a new OSD on it!). So the process looks like:
ceph orch set-unmanaged osd.rrd_NVMe
- Stop managing the disk serviceceph orch osd rm XX --zap
- remove osd XX, wipe the associated storage- Wait until
ceph orch osd rm status
says the removal has finished - Ask the DC team to swap the drive
- Make sure the new disk is visible to the OS (which may need
sudo megacli -pdmakejbod -physdrv [YY:ZZ] -a0
and/orsudo megacli -CfgForeign -Clear -a0
) ceph orch set-managed osd.rrd_NVMe
- Make a cup of tea (it takes a little while for cephadm to notice the new disk and build the new OSD)
Cephadm Configuration
Puppet templates out a number of configuration files into /etc/cephadm
:
bootstrap-ceph.conf
used in bootstrapping a new cluster, sets a CRUSH rule and which mgr modules are usedhosts.yaml
defines the hosts in a cluster, used when adding new hostsosd_spec.yaml
tells Ceph how to use the storage on nodes to build OSDsrgw_spec.yaml
tells Ceph where to run RGWszone_spec.yaml
defines the realm, zonegroup, zone, and endpoint for the S3 service
Most of these are only used for initial cluster setup, but might need adjusting if we need to change aspects of how the cluster is operated. The need for rgw_spec.yaml
and zone_spec.yaml
to be separate is because of a limitation in the rgw module of cephadm that you cannot specify zone endpoints and rgw placements in the same spec file.
Cluster info in hiera
While there are three Cephadm roles in Puppet (cephadm::storage, cephadm::controller, and cephadm::rgw), how clusters are actually assembled depends on the cephadm_clusters
defined in hiera. As an example:
cephadm_clusters: apus: cluster_name: 'apus-eqiad' mon_network: 2620:0:861:100::/56 controller: 'moss-be1001.eqiad.wmnet' osds: &osds - moss-be1001.eqiad.wmnet - moss-be1002.eqiad.wmnet - moss-be1003.eqiad.wmnet monitors: *osds rgws: - moss-fe1001.eqiad.wmnet - moss-fe1002.eqiad.wmnet
The cluster label (here "apus") should match profile::cephadm::cluster_label
that is set in hiera for the nodes in that cluster. The mon_network is the network which all mons for the cluster must be on, and is the IPv6 private network for the relevant DC.
Image Building
Our image policy is that all images must be built locally. Rather than replicating upstream's complex machinery for image building, we build simpler images ourselves out of the [docker-images repository], based on the Debian packages pulled from upstream into our local apt repository.
Bootstrapping a Cluster
This documents how the apus clusters were bootstrapped, and ought to be repeatable for new clusters with suitable hosts and hiera ready, and puppet run on all cluster nodes.
Start on the controller node, and bootstrap an initial cluster containing the hosts (but no OSDs or RGWs):
sudo cephadm --image docker-registry.wikimedia.org/ceph:18.2.2-wmf7 bootstrap --skip-dashboard --skip-firewalld --cleanup-on-failure --skip-monitoring-stack --ssh-private-key /root/.ssh/id_cephadm --ssh-public-key /root/.ssh/id_cephadm.pub --mon-ip IPV6_OF_INITIAL_MON --config /etc/cephadm/bootstrap-ceph.conf --apply-spec /etc/cephadm/hosts.yaml
Adjusting the image version (and mon-ip) as necessary. If this completes successfully you should be able to run sudo cephadm shell
and connect to the cluster, and then ceph -s
should show you all the hosts you expect (as well as warnings about the lack of OSDs).
Then set up OSDs by telling cephadm about your OSD specification:
sudo cephadm shell -- ceph orch apply -i - < /etc/cephadm/osd_spec.yaml
Then go and make a cup of tea - it takes time to bring up all the OSDs successfully, and the cluster will be in some odd states in the mean time, which it's not worth worrying about!
If you want cluster metrics to be collected (which you almost certainly do), then enable the prometheus module and set a 60s scraping interval (as that's what we have configured in puppet for these clusters):
ceph config set mgr mgr/prometheus/scrape_interval 60.0 ceph mgr module enable prometheus
You now have a working Ceph cluster. If you're setting up a multisite setup, next go and set up the other cluster, and only proceed once both are in HEALTH_OK
.
Set Up Master Zone
Bootstrap the new zone (due to a limitation in cephadm, specifying endpoints means no RGWs will be started):
sudo cephadm shell -- ceph rgw realm bootstrap -i - </etc/cephadm/zone_spec.yaml
That should finish with output like "Realm(s) created correctly. Please, use 'ceph rgw realm tokens' to get the token." Then you need to set up the zonegroup hostnames (which will be the discovery and dc-specific hostnames to be used - ceph needs to know what hostnames its serving), which unfortunately involves editing json:
radosgw-admin zonegroup get > jsonfile
- edit jsonfile
radosgw-admin zonegroup set --infile /dev/stdin < jsonfile
You want to edit the jsonfile to update the "hostnames" section of the zonegroup, e.g.:
{
"id": "19f169b0-5e44-4086-9e1b-0df871fbea50",
"name": "apus_zg",
[...]
"hostnames": [
"apus.discovery.wmnet",
"apus.svc.codfw.wmnet",
"apus.svc.eqiad.wmnet"
],
[...]
}
Then you need to actually start up some RGWs by telling cephadm about your RGW spec file:
ceph orch apply -i < /etc/cephadm/rgw_spec.yaml
To go any further, you'll need to have got your service into LVS at least to the lvs_setup
state, so that your endpoints in eqiad and codfw are reachable, that process is documented elsewhere.
Set Up Secondary Zone
Make sure you can reach your master zone from the secondary DZ (curl
is a sufficient tool for testing this). Then you need a realm token from the master zone to set up the secondary. Get this by running against the master ceph cluster
sudo cephadm shell -- ceph rgw realm tokens
Take care with the output of this, as the "token" is a credential (it's base64-encoded, so you can paste it into base64 -d
if you want to check that e.g. the endpoint is correct). Then you want to take /etc/cephadm/zone_spec.yaml
and edit it to add a line
rgw_realm_token: "PASTE TOKEN HERE"
after the rgw_realm
line - either pause puppet to stop it overwriting your change, or copy to a new location while you work (and delete once finished). Once that's done, create the new zone (in the existing zonegroup):
sudo cephadm shell -- ceph rgw zone create -i </etc/cephadm/zone_spec.yaml
If that works, then you just need to start some RGWs as before:
ceph orch apply -i < /etc/cephadm/rgw_spec.yaml
If it fails, it's worth double-checking that you can reach the relevant endpoint in the other DC (the error messages if you can't are really unhelpful).