Dumps/Dumpsdata hosts

XML Dumpsdata hosts

Hardware

We have various hosts:

Dumpsdata1003 in eqiad, production misc dumps nfs:
Hardware/OS: PowerEdge R730xd, Debian 11 (bullseye), 64GB RAM, 1 quad-core Xeon Silver 4112 cpu, HT enabled

Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
Dumpsdata1004 in eqiad, production xml/sql dumps nfs spare
Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled

Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
Dumpsdata1005 in eqiad, production xml/sql dumps nfs spare
Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled

Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
Dumpsdata1006 in eqiad, production xml/sql dumps nfs primary
Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled

Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
Dumpsdata1007 in eqiad, production xml/sql and misc dumps nfs fallback
Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled

Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS

Services

The production host is nfs-mounted on the snapshot hosts; generated dumps are written there and rsynced from there to the web and rsync servers.

Deploying a new host

Now done by dc ops!

You'll need to set up the raid arrays by hand. The single LVM volume mounted on /data is ext4.

You'll want to make sure your host has the right partman recipe in netboot.cfg (configured in modules/profile/data/profile/installserver/preseed.yaml), at this writing dumpsdata100X.cfg. This will set up sda1 as /boot, one lvm as / and one as /data.

Install in the usual way (add to puppet, copying a pre-existing production dumpsdata host stanza, set up everything for PXE boot and go).

The inital role for the host can be dumps::generation::server::spare from the dc ops folks and we can then move it into production from there. This requires the creation of the file hieradata/hosts/dumpsdataXXXX.yaml (substitute in the right number) stealing from any existing spare or from a fallback host. They will also need to add the new host to the key profile::dumps::peer_hosts in the common/profile/dumps.yaml file.

Example patch (two patches because we make mistakes): https://gerrit.wikimedia.org/r/c/operations/puppet/+/893031 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/893519

If dc ops is uncomfortable with doing that, they may give it to us with the setup role applied, and we can set it up as dumps::generation::server::spare ourselves, doing the above.

The /data partition will likely be set up wrong and you'll have to go fiddle with the LVm and expand it and the filesystem to be a decent size; check the df when you get the host and see how it is first.

Doing the actual swap in of a spare dumpsdata host for a live one

This is extra fiddly because of NFS.

When you have the new dumpsdata host imaged and with the dumps::generation::server::spare role, you'll want to do roughly the following:

TL;DR version -- make a patch to move the live host to spare and the new one to the new role; turn off puppet on those and all snapshots that mount the live share; make sure any jobs running are gone (rsyncs on dumpsdata hosts, dumps on snapshots, etc); unmount the nfs share wherever it is mounted; run puppet on the live host to make it a spare, on the new host to make it the new role, enable puppet on a testbed snapshot and test; if all good, enable it on any other snapshots where the share was mounted and check them.

Long version:

Run rpcinfo -p on the new host and compare to the old host, make sure the ports are the same. If not, you might reboot the new host; there is a race condition which sometimes causes the various nfs daemons to start before the changes to defaults have been initially applied via puppet.
Make sure the current dumps run is complete, for whichever type of NFS share you are replacing (sql/xml dumps, or "misc" dumps. If you are replacing a fallback host that handles both, make sure the sql/xml dumps are complete.
Do a final rsync of all dumps data from the live production dumpsdata host to the new one; if you are swapping in a new fallback host, it may be fallback for both sql/xml dumps and 'other' dumps, and so you may ned rsyncs of both directory trees from the current fallback host for this. A fallback host for the sql/xml dumps should have the systemd timer and job dumps-stats-sender removed as well.
Stop puppet on all snapshot hosts with it mounted
If this is a new SQL/XML dumps primary NFS share, stop and disable the dumps monitor (systemd job) on the snapshot with the dumps::generation::worker::dumper_monitor role
Unmount the live share from the snapshots where it is mounted
disable puppet on the live and the new NFS shares
If this is a new SQL/XML dumps primary NFS share or a new "other dumps" primary NFS share, on the live host, stop and disable the systemd job that rsyncs the data elsewhere
Update puppet manifests to make the live share a dumps::generation::server::spare host and apply the appropriate role to the new dumpsdata host
On the new dumpsdata host, enable puppet and run it to make that the new live share
On the old host, enable puppet and run it to make that a dumpsdata spare
If this is a new SQL/XML dumps primary NFS share, on a snapshot testbed, enable and run puppet, check that the mount comes from the new host, and try a test run of dumps (using test config and test output dir)
Enable puppet on each snapshot host where it was disabled, run puppet, check that the mount comes from the new location
Update this page (the one you are reading right now) with the new info

= The puppet patch to swap in a new host

You will need to update the entry for the host in hieradata/hosts depending on what it is going to be doing; check the stanza for the existing live host.

If the server will be primary for SQL/XML dumps, it needs to replace the existing hostname in hiera/common.yaml for the dumps_nfs_server key. If the server will be primary for the "misc" dumps, it should replace the hostname in that file for the dumps_cron_nfs_server key instead.

You'll also need to update the host role in manifest/site.pp.

If it is a fallback host, data must be rsynced to it from the primary on a regular basis. To make this happen, add the hostname to the profile::dumps::internal in hieradata/common/profile/dumps.yaml.

If it is to be a primary host and the old primary is to go away, when you are ready to make the switch you will need to change profile::dumps::generation::worker::common so that the dumpsdatamount resource mounts the new server's filesystem on the snapshot hosts instead of mounting the old primary server.

An example of such a patch is here, swapping in a new host for the SQL/XML primary NFS share: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924949/

Reimaging an old host

This assumes you are following the Server Lifecycle/Reimage procedure.

You likely want to preserve all the data on the /data filesystem. To make this happen, in netboot.cfg you'll want to set your host to use dumpsdata100X-no-data-format.cfg. This requires manual intervention during partitioning. Sepcifically, you'll need to select:

/dev/sda1 to use as ext4, mount at /boot, format
data vg to use as ext4, mount at /data, preserve all contents
root vg to use as ext4, mount at /, format

Write the changes to disk and the install will carry on as desired.