Dumps/Dumpsdata hosts
XML Dumpsdata hosts
Hardware
We have various hosts:
- Dumpsdata1003 in eqiad, production misc dumps nfs:
- Hardware/OS: PowerEdge R730xd, Debian 11 (bullseye), 64GB RAM, 1 quad-core Xeon Silver 4112 cpu, HT enabled
- Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
- Dumpsdata1004 in eqiad, production xml/sql dumps nfs spare
- Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled
- Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
- Dumpsdata1005 in eqiad, production xml/sql dumps nfs spare
- Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled
- Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
- Dumpsdata1006 in eqiad, production xml/sql dumps nfs primary
- Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled
- Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
- Dumpsdata1007 in eqiad, production xml/sql and misc dumps nfs fallback
- Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled
- Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
Services
The production host is nfs-mounted on the snapshot hosts; generated dumps are written there and rsynced from there to the web and rsync servers.
Deploying a new host
Now done by dc ops!
You'll need to set up the raid arrays by hand. The single LVM volume mounted on /data is ext4.
You'll want to make sure your host has the right partman recipe in netboot.cfg (configured in modules/profile/data/profile/installserver/preseed.yaml
), at this writing dumpsdata100X.cfg. This will set up sda1 as /boot, one lvm as / and one as /data.
Install in the usual way (add to puppet, copying a pre-existing production dumpsdata host stanza, set up everything for PXE boot and go).
The inital role for the host can be dumps::generation::server::spare
from the dc ops folks and we can then move it into production from there.
This requires the creation of the file hieradata/hosts/dumpsdataXXXX.yaml
(substitute in the right number) stealing from any existing spare or from a fallback host.
They will also need to add the new host to the key profile::dumps::peer_hosts
in the common/profile/dumps.yaml
file.
Example patch (two patches because we make mistakes): https://gerrit.wikimedia.org/r/c/operations/puppet/+/893031 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/893519
If dc ops is uncomfortable with doing that, they may give it to us with the setup role applied, and we can set it up as dumps::generation::server::spare
ourselves, doing the above.
The /data partition will likely be set up wrong and you'll have to go fiddle with the LVm and expand it and the filesystem to be a decent size; check the df when you get the host and see how it is first.
Doing the actual swap in of a spare dumpsdata host for a live one
This is extra fiddly because of NFS.
When you have the new dumpsdata host imaged and with the dumps::generation::server::spare
role, you'll want to do roughly the following:
TL;DR version -- make a patch to move the live host to spare and the new one to the new role; turn off puppet on those and all snapshots that mount the live share; make sure any jobs running are gone (rsyncs on dumpsdata hosts, dumps on snapshots, etc); unmount the nfs share wherever it is mounted; run puppet on the live host to make it a spare, on the new host to make it the new role, enable puppet on a testbed snapshot and test; if all good, enable it on any other snapshots where the share was mounted and check them.
Long version:
- Run
rpcinfo -p
on the new host and compare to the old host, make sure the ports are the same. If not, you might reboot the new host; there is a race condition which sometimes causes the various nfs daemons to start before the changes to defaults have been initially applied via puppet. - Make sure the current dumps run is complete, for whichever type of NFS share you are replacing (sql/xml dumps, or "misc" dumps. If you are replacing a fallback host that handles both, make sure the sql/xml dumps are complete.
- Do a final rsync of all dumps data from the live production dumpsdata host to the new one; if you are swapping in a new fallback host, it may be fallback for both sql/xml dumps and 'other' dumps, and so you may ned rsyncs of both directory trees from the current fallback host for this. A fallback host for the sql/xml dumps should have the systemd timer and job dumps-stats-sender removed as well.
- Stop puppet on all snapshot hosts with it mounted
- If this is a new SQL/XML dumps primary NFS share, stop and disable the dumps monitor (systemd job) on the snapshot with the
dumps::generation::worker::dumper_monitor
role - Unmount the live share from the snapshots where it is mounted
- disable puppet on the live and the new NFS shares
- If this is a new SQL/XML dumps primary NFS share or a new "other dumps" primary NFS share, on the live host, stop and disable the systemd job that rsyncs the data elsewhere
- Update puppet manifests to make the live share a
dumps::generation::server::spare
host and apply the appropriate role to the new dumpsdata host - On the new dumpsdata host, enable puppet and run it to make that the new live share
- On the old host, enable puppet and run it to make that a dumpsdata spare
- If this is a new SQL/XML dumps primary NFS share, on a snapshot testbed, enable and run puppet, check that the mount comes from the new host, and try a test run of dumps (using test config and test output dir)
- Enable puppet on each snapshot host where it was disabled, run puppet, check that the mount comes from the new location
- Update this page (the one you are reading right now) with the new info
= The puppet patch to swap in a new host
You will need to update the entry for the host in hieradata/hosts depending on what it is going to be doing; check the stanza for the existing live host.
If the server will be primary for SQL/XML dumps, it needs to replace the existing hostname in hiera/common.yaml
for the dumps_nfs_server
key. If the server will be primary for the "misc" dumps, it should replace the hostname in that file for the dumps_cron_nfs_server
key instead.
You'll also need to update the host role in manifest/site.pp
.
If it is a fallback host, data must be rsynced to it from the primary on a regular basis. To make this happen, add the hostname to the profile::dumps::internal
in hieradata/common/profile/dumps.yaml
.
If it is to be a primary host and the old primary is to go away, when you are ready to make the switch you will need to change profile::dumps::generation::worker::common
so that the dumpsdatamount
resource mounts the new server's filesystem on the snapshot hosts instead of mounting the old primary server.
An example of such a patch is here, swapping in a new host for the SQL/XML primary NFS share: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924949/
Reimaging an old host
This assumes you are following the Server Lifecycle/Reimage procedure.
You likely want to preserve all the data on the /data
filesystem. To make this happen, in netboot.cfg you'll want to set your host to use dumpsdata100X-no-data-format.cfg. This requires manual intervention during partitioning. Sepcifically, you'll need to select:
- /dev/sda1 to use as ext4, mount at
/boot
, format - data vg to use as ext4, mount at
/data
, preserve all contents - root vg to use as ext4, mount at
/
, format
Write the changes to disk and the install will carry on as desired.