Dumps/Dumpsdata hosts

From Wikitech

XML Dumpsdata hosts

Hardware

We have various hosts:

  • Dumpsdata1001 in eqiad, production xml/sql dumps nfs spare, to be decommed:
    Hardware/OS: PowerEdge R730xd, Debian 8 (buster), 32GB RAM, 1 quad-core Xeon ES2623 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1002 in eqiad, production xml/sql dumps nfs spare, to be decommed:
    Hardware/OS: PowerEdge R730xd, Debian 10 (buster), 32GB RAM, 1 quad-core Xeon ES2623 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1003 in eqiad, production misc dumps nfs:
    Hardware/OS: PowerEdge R730xd, Debian 10 (bullseye), 64GB RAM, 1 quad-core Xeon Silver 4112 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1004 in eqiad, production xml/sql dumps nfs spare
    Hardware: PowerEdge R740XD, Debian 10 (buster), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1005 in eqiad, production xml/sql dumps nfs spare
    Hardware: PowerEdge R740XD, Debian 10 (buster), 64GB RAM, 1 Intel Xeon Silver 4112 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1006 in eqiad, production xml/sql dumps nfs primary
    Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS
  • Dumpsdata1007 in eqiad, production xml/sql and misc dumps nfs fallback
    Hardware: PowerEdge R740XD, Debian 11 (bullseye), 64GB RAM, 1 Intel Xeon Gold 5220 cpu, HT enabled
    Disks: 12 4TB disks in 1 12-disk raid10 volume; two 1T disks in raid 1 for the OS

Services

The production host is nfs-mounted on the snapshot hosts; generated dumps are written there and rsynced from there to the web and rsync servers.

Deploying a new host

Now done by dc ops!

You'll need to set up the raid arrays by hand. The single LVM volume mounted on /data is ext4.

You'll want to make sure your host has the right partman recipe in netboot.cfg (configured in modules/profile/data/profile/installserver/preseed.yaml), at this writing dumpsdata100X.cfg. This will set up sda1 as /boot, one lvm as / and one as /data.

Install in the usual way (add to puppet, copying a pre-existing production dumpsdata host stanza, set up everything for PXE boot and go).

The inital role for the host can be dumps::generation::server::spare from the dc ops folks and we can then move it into production from there. This requires the creation of the file hieradata/hosts/dumpsdataXXXX.yaml (substitute in the right number) stealing from any existing spare or from a fallback host. They will also need to add the new host to the key profile::dumps::peer_hosts in the common/profile/dumps.yaml file.

Example patch (two patches because we make mistakes): https://gerrit.wikimedia.org/r/c/operations/puppet/+/893031 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/893519

If dc ops is uncomfortable with doing that, they may give it to us with the setup role applied, and we can set it up as dumps::generation::server::spare ourselves, doing the above.

The /data partition will likely be set up wrong and you'll have to go fiddle with the LVm and expand it and the filesystem to be a decent size; check the df when you get the host and see how it is first.

Doing the actual swap in of a spare dumpsdata host for a live one

This is extra fiddly because of NFS.

When you have the new dumpsdata host imaged and with the dumps::generation::server::spare role, you'll want to do roughly the following:

TL;DR version -- make a patch to move the live host to spare and the new one to the new role; turn off puppet on those and all snapshots that mount the live share; make sure any jobs running are gone (rsyncs on dumpsdata hosts, dumps on snapshots, etc); unmount the nfs share wherever it is mounted; run puppet on the live host to make it a spare, on the new host to make it the new role, enable puppet on a testbed snapshot and test; if all good, enable it on any other snapshots where the share was mounted and check them.

Long version:

  • Run rpcinfo -p on the new host and compare to the old host, make sure the ports are the same. If not, you might reboot the new host; there is a race condition which sometimes causes the various nfs daemons to start before the changes to defaults have been initially applied via puppet.
  • Make sure the current dumps run is complete, for whichever type of NFS share you are replacing (sql/xml dumps, or "misc" dumps. If you are replacing a fallback host that handles both, make sure the sql/xml dumps are complete.
  • Do a final rsync of all dumps data from the live production dumpsdata host to the new one; if you are swapping in a new fallback host, it may be fallback for both sql/xml dumps and 'other' dumps, and so you may ned rsyncs of both directory trees from the current fallback host for this. A fallback host for the sql/xml dumps should have the systemd timer and job dumps-stats-sender removed as well.
  • Stop puppet on all snapshot hosts with it mounted
  • If this is a new SQL/XML dumps primary NFS share, stop and disable the dumps monitor (systemd job) on the snapshot with the dumps::generation::worker::dumper_monitor role
  • Unmount the live share from the snapshots where it is mounted
  • disable puppet on the live and the new NFS shares
  • If this is a new SQL/XML dumps primary NFS share or a new "other dumps" primary NFS share, on the live host, stop and disable the systemd job that rsyncs the data elsewhere
  • Update puppet manifests to make the live share a dumps::generation::server::spare host and apply the appropriate role to the new dumpsdata host
  • On the new dumpsdata host, enable puppet and run it to make that the new live share
  • On the old host, enable puppet and run it to make that a dumpsdata spare
  • If this is a new SQL/XML dumps primary NFS share, on a snapshot testbed, enable and run puppet, check that the mount comes from the new host, and try a test run of dumps (using test config and test output dir)
  • Enable puppet on each snapshot host where it was disabled, run puppet, check that the mount comes from the new location
  • Update this page (the one you are reading right now) with the new info

= The puppet patch to swap in a new host

You will need to update the entry for the host in hieradata/hosts depending on what it is going to be doing; check the stanza for the existing live host.

If the server will be primary for SQL/XML dumps, it needs to replace the existing hostname in hiera/common.yaml for the dumps_nfs_server key. If the server will be primary for the "misc" dumps, it should replace the hostname in that file for the dumps_cron_nfs_server key instead.

You'll also need to update the host role in manifest/site.pp.

If it is a fallback host, data must be rsynced to it from the primary on a regular basis. To make this happen, add the hostname to the profile::dumps::internal in hieradata/common/profile/dumps.yaml.

If it is to be a primary host and the old primary is to go away, when you are ready to make the switch you will need to change profile::dumps::generation::worker::common so that the dumpsdatamount resource mounts the new server's filesystem on the snapshot hosts instead of mounting the old primary server.

An example of such a patch is here, swapping in a new host for the SQL/XML primary NFS share: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924949/

Reimaging an old host

This assumes you are following the Server Lifecycle/Reimage procedure.

You likely want to preserve all the data on the /data filesystem. To make this happen, in netboot.cfg you'll want to set your host to use dumpsdata100X-no-data-format.cfg. This requires manual intervention during partitioning. Sepcifically, you'll need to select:

  • /dev/sda1 to use as ext4, mount at /boot, format
  • data vg to use as ext4, mount at /data, preserve all contents
  • root vg to use as ext4, mount at /, format

Write the changes to disk and the install will carry on as desired.