Transfer.py

transfer.py is a Python 3 script intended to move large files or directory trees between WMF production hosts in an efficient way, initially thought for database maintenance, but which can be used to move arbitrary files.

Context

Before there was a backup and provisioning workflow, new database hosts (or hosts that needed rebuild, e.g. after a crash) were setup in different, completely manual ways. Mostly using netcat.

Before cumin was available, work started on a way to automate that (task T156462) with the aim to add a programmable api, consistency, speedup, handling the firewall, and generally stop doing things manually. Used at first to perform cold database copies.

Once cumin was deployed, and logical backups were in place, transfer.py was extended to be used for hot database backups and recoveries, as a building block of the recovery system.

Technical details

The original spec for a transfer script was using multicast or bittorrent protocols for fast recovery of multiple hosts. For the time, however, basically cumin is run so that netcat is run on source, and another instance listening on target host, and either a single file or a tar copy of a directory is piped through. Optionally, compression with pigz can be used, as well as encryption with openssl. Pigz can improve enormously the speed of the transfer, as database can be compressed as much as 5 times, reducing the total bandwidth used. Data can also be checksummed, but that adds some overhead at the beginning of the transfer.

At the Wikimedia Foundation infrastructure, cumin is being used as the remote execution framework, but others are also available and can be made to work. However, for things like mysql transfers, certain things like mysql port assignation and paths are assumed to be in a specific places.

For MySQL, two modes (in addition to the default "file") were added:

xtrabackup, where the source of the backup is not obtained from the filesystem, but from a mariabackup run with streaming (xbstream) from a running MariaDB server. In this mode, the mysql server socket is used as the origin path. It can optionally stop replication in cases where that may speed up the copy/preparation process. It only transfer the files, it does not prepare or touch in any way the generated files (that is considered out of scope of this script, as one may want to wait to do that for incremental backups or other reasons).
decompress, where the source is a precompressed tar.gz file containing a single directory, intended with the same format as the snapshot tarballs generated by the backup system (prepared xtrabackup datadirs), and result in a decompressed directory (very similar to "file" operation, but without having to compress on source)

In addition to the automation of all commands (iptables, tar, openssl, pigz, nc, remote execution), transfer.py does important sanity checks, like making sure it does not overwrite existing files (it aborts before transfer) and making sure it has enough disk space on the selected directory.

Dependencies

transfer.py requires the following technologies

Python 3, preferable 3.5 or later
- cumin python class if chosen as the transfer system
A remote execution system (ssh, paramiko, salt, etc.). If none are available, there is a LocalExecution class, but it will only allow to run commands locally (local transfers)
- For cumin, transfer.py must be installed on a cumin* host to be able to execute remote commands
Netcat (nc)
pigz for compression
tar for archiving before streaming
openssl for encryption
du, df to calculate used and available disk size
bash to pipe the different unix commands
wmf-mariadb package and an instance running for --type=xtrabackup
xtrabackup (mariabackup) installed locally on the mariadb hosts for --type=xtrabackup
mysql client if replication wants to be stopped
iptables to manage the firewall hole during transfer

Note: transfer.py expect the user to have root privileges without the sudo prefix.

Usage

transfer.py is installed via the debian package transferpy. This will put transfer.py on PATH on WMF production infrastructure (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet), and has to run as root/sudo (like cumin).

For an up-to-date list of options, go to https://doc.wikimedia.org/transferpy/master/usage.html

Obtaining

transfer.py is an utility developed on the operations/software/transferpy repo (Gerrit). HEAD should be stable enough, but the releases are stored on the Wikimedia apt repository (currently, only for Buster).

What's New? (GSoC-2020)

Now the transfer framework moved to its own module named transferpy. This framework has 3 basic modules and a RemoteExecution:

Transferrer: The Transferrer class is responsible for the act on the user arguments and make the send/receive possible.
Firewall: The Firewall class is for open/close the ports in the iptables in order to receive the data by the receiver machines.
MariaDB:
RemoteExecution: The RemoteExecution is the module responsible for the execution of command on the remote machines. transfer framework mainly uses the Cumin execution.

Wishlist and known issues

The encryption has a very negative impact on the performance, and it is not forward-secret. A low penalty alternative with forward secrecy should be used instead
Sizes are calculated with du, which is known to produce different results on different hosts even if the copy has been accurate. This is why the size check gives only a warning if it shows a difference on source and target hosts. A different, more reliable method could be used, but may take more resources.
Checksum happens in a previous step before transfer- it would be nice to run checksumming in parallel (or at the same time) with transfer so it doesn't impact its latency and it is not normally disabled
Configurable compression by using other algorithms depending on the data (e.g. lz if compression speed is not the limiting factor, etc.) Zstd in particular looks promising, if claims of superior speed/compression are verified.
Multicast, torrent or other solution should be setup to allow parallel transmission of data to multiple hosts in an efficient manner
Better logging and error checking
In general, more flexibility (e.g. level of parallelism, etc.) as long as it uses by default or autodetects saner defaults to not increase too much the difficulty of usage
Firewall hole opening should be optional
It should check that a port is available before binding to it (race condition)
It should also wait until port is fully opened (polling), instead of just waiting 3 seconds
More tests for "bad input argument" combinations
kill_job function in CuminExecution kill the subprocess in the transferpy running machine. Instead, it should kill the actual process in the remote machine.
It cannot currently transfer files between the analytics and production VLANs, since this would require opening a hole in the firewall on the network devices themselves.