Obsolete:Zfs replication

This page contains historical information. It may be outdated or unreliable.

We use Zfs replication to copy data from one host with Zfs to another. In particular this is how we keep replicated copies of our image data (and soon of our thumbnails).

The basic idea of Zfs replication is that one creates a snapshot of the filesystem one wants to send initially to a secondary host. The command zfs send is invoked on the first server, zfs recv on the host getting the copy, and then the data is streamed over block by block, rather than relying on a process that must walk through directory trees.

Remember that a snapshot is not actually a literal copy of the filesystem at a point in time; it is just a set of reference pointers to blocks. If files are removed from the live filesystem, the blocks are still referenced in the snapshot and so they don't get put back in the free pool (and the snapshot grows in size). Zfs send walks through the blocks referenced by the snapshot given as a parameter and sends all the blocks to stdout. Zfs recv reads blocks from stdin and creates or updates a filesystem from them. See [1] for more about the innards.

In a nutshell

To get replication going between two hosts, do the following. (We use as an example the filesystem /export/upload.)

Send the initial snapshot and create the filesystem on the remote host

make sure that the receiving host is running Solaris updated after October 2008
make sure the filesystem you are replicating does not exist already on the receving host
zfs snapshot export/upload@zrep-00001 on the origin host
zfs send -i export/upload@zrep-00001 | ssh otherservername "cat > /export/save/upload@zrep-00001"
wait several days depending on the size of the dataset
cat /export/save/upload@rzrepl-00001 | zfs recv export/upload

Send incremental covering the days that the first step took

zfs snapshot export/upload@zrep-00002 on the origin host
zfs send -i export/upload@zrep-00001 export/upload@zrep-00002 | ssh otherservername "cat > /export/save/upload@zrep-00002"
cat /export/save/upload@rzrepl-00002 | zfs recv export/upload

Prep to automate the job

Repeat the above incrementing the snapshot number each time, until transfer time takes less than a minute.
Create the file /etc/zrep-last-snapshot on the origin host with the number of the last snapshot you created and sent. Example: if you sent export/upload@zrep-18191, the file would contain 18191.

Add to cron and forget about it

Set up a cron job on the receiving host using one of the zfs-replicate scripts from /home/wikipedia/conf/zfs-tools/optlocalbin/ (see the README in that directory). You will need to edit the script according to the host and the filesystem. You can add your script there. Also see the crontabs in /home/wikipedia/conf/zfs-tools/crontab and add your host's crontab there.

Trouble

Let's suppose that the cron job dies the death or zfs craps on itself or some other fun thing. What do you do then?

First, disable the cron replication job on the primary media server (at the time of this writing, ms7): /opt/local/bin/zfs-replication.sh should get commented out in the crontab.
Now, determine which is the last snapshot that made it over successfully to the replication host (at this writing, ms8). zfs list | tail -20 will show you some lines like , find the one with the most recent date/time in the name. Let's call it snapshotA.
Make sure that snapshot exists on the primary media host. You can give the command zfs list snapshotA; don't try a plain zfs list, it was running for up to 30 minutes without producing output on the primary host, last time I checked. If the listing shows it there, you can skip the next step.
The snapshot wasn't on the primary? Find the last one that is.
- You can't use a plain zfs list but you can use zpool history -il export | tail -20 (change 'export" to the pool name if this is not longer right). You want to look at the lines that look like 2012-04-24.07:30:01 zfs snapshot export/upload@partial-2012-04-24_07:30:00 [user root on ms7:global].
- On the replication host look at zfs list | tail -20 and find the latest snapshot both on the primary and on the replication host. We'll call this one snapshotA now.
- zfs list the snapshot on the primary to make sure it's really there and wasn't deleted later.
- Roll back the replication host to that snapshot: zfs rollback snapshotA
- Now continue with the normal steps.
Find the most recent snapshot on the primary; it might be a daily or a weekly or a partial. Use zfs list | tail -20 to find it. Let's call this one snapshotB.
Do zfs replication for this and all intermediate snapshots back to snapshotA: on the primary run zfs send -I snapshotA snapshotB | /usr/bin/ssh replication-hostname-here "/opt/wmf/bin/mbuffer -q -b 26214 | zfs receive -F export/upload" Do not leave out the double quotes!! Change export/upload to the right filesystem if thing have been moved around in the meantime.
When it completes, zfs list | tail -20 on the replication host to make sure you now see snapshotB in the list.
On the primary, edit the file /etc/partial-snapshot (?) which will consist of one line with a snapshot name in it. Put the name of snapshotB in there instead.
Edit cron to reactivate the replication cron job. It will now start from where you left off.
Bonus points, if you feel paranoid, check that it's running on schedule, and after an hour check that at least one new snapshot has made it over to the replication host.

Some timing statistics

Initial filesystem transfer

Note that these tests were done around the time we hit the 5 million file mark on Commons, so that gives you a vague idea of the number of files we were working with. (Order of magnitude.)

Zfs replication can be pretty slow. Here's how it's supposed to work:

Create the first snapshot
zfs snapshot export/upload@replicateme-1
Send it to the other host
zfs send export/upload@replicateme-1 | ssh otherservername "zfs recv export/upload@replicateme-1"
Profit!

In reality, no. The snapshot gets created almost instantly. But the send and recv are both time-consuming. Zfs recv needs to write out an exact copy of the data it gets, so while it's futzing about on the disk, ssh is saying it doesn't have enough buffer to read anymore incoming data. So it tells zfs send to wait up. Zfs send is also fuzting around on the disks reading blocks, but there are periods where it finds chunks of data it can toss down the tubes quickly Only it's thwarted because Zfs recv backs up. So... running the command this way gets you around 5 GB a day, if you are lucky. This is between two hosts in the same rack!

A program called "mbuffer" can be put right behind ssh: it can hold a lot of input while waiting for zfs recv to catch up. But even that doesn't get us a lot of gain. Runnign the command like this:

zfs send export/upload@replicateme-1 | ssh otherservername "/opt/ts/bin/mbuffer -q -b 26214 2>/dev/null | zfs recv export/upload@replicateme-1"

gets us about 1 GB an hour. When you have 6T, that's going to take... 100 days :-P

So what we do for these huge initial transfers is

zfs send -i export/upload@replicateme-1 | ssh otherservername "cat > /export/save/upload@replicateme-1"

Notice that we put it in another directory. (Or we could name it soemthing else.) That's because when the file system eventually gets written out on the new host, it's going to include the export/upload@replicateme-1 as a snapshot. This just avoids any issues.

Doing it this way we were able to complete a transfer of 6.1T in 10 days. (From ms1 (Sun Fire X4500) to ms5 (Sun Fire X4540), both hosts in the same rack.)

Unpacking the data

Once the snapshot stream transfer has completed, you can

cat /export/save/upload@replicateme-1 | zfs recv export/upload

This completed in less than a day for us on ms5 with 6.1T of data.

First incremental transfer

In order to send over an incremental, one wants to do the following:

Create the next snapshot (this will allow us to send the difference between this one and the previous one)
zfs snapshot export/upload@replicateme-2
Send the incremental (-i for zfs send means diff between the two snapshots, -F for zfs recv means to rollback to the state just after the last imported snapshot).
zfs send -i export/upload@replicateme-1 export/upload@replicateme-2 | ssh otherservername "/opt/ts/bin/mbuffer -q -b 26214 2>/dev/null | zfs receive -F export/upload@zreplicateme-2"

However...

After 10 days to get the 6.1T over, we had to get the next 10 days worth of stuff transferred. The above statistics still apply for incremental transfers. The above transfer with mbuffer was looking like about 1 GB an hour, as usual. So instead we ran

zfs send -i export/upload@zreplicateme-1 export/upload@zreplicateme-2 | ssh otherservername "cat > /export/save/upload@zreplicateme-2"

which completed in 5.5 hours for 120GB of data transfered.

The data unpacking time was about an hour and ten minutes (120GB of data, existing fielesystem 6.05T).

Further incrementals

1.5G of data (covering the 1 hour and 15 minutes for the previous unpacking) transfered using the above method in about 3 minutes, maybe less.

Unpacking this 1.5GB took about 4 minutes. This was using the above method (cat to file, zfs recv afterwards).

Once you are relatively caught up you can switch to one-stage transfers. This is what we want, because this runs out of cron once very ten miuntes, so that in case of catastrophe the most data we lose is ten minute's worth of uploads. (That is between 35 to 90 images, as of 4 September 2009.)

The replication job in cron runs /opt/local/bin/zfs-replicate.images which does a zfs create of a snapshot with a number on the end. We use this number to keep track of which one is next in the sequence. It then sends the incremental the way it was supposed to work:

zfs send -i export/uploadzrep-nnn export/upload@zrep-nnn+1 | ssh servername "/opt/ts/bin/mbuffer -q -b 26214 2>/dev/null | zfs receive -F export/upload@zrep-nnn+1

When it arrives, the script does zfs destroy of the old snapshot export/uploadzrep-nnn on both sides, and we're done.

These ten-minute incrementals appear to run in under a minute.

We should keep an eye on them; I wonder what happens the first time someone uploads a 100mb file on a fast connection. We should have some stats on number of uploads and size of uploads in 10 minute intervals and track it for a month (really, track it forever, it would be useful to have this data). I'd also like to see this data aggregated for daily and weekly reports.

NOTE: replication currently runs once every 20 minutes; the ten minute intervals stopped being long enough.

Other notes

Because Zfs utilizes copy-on-write, each snapshot consists of a list of blocks that have changed since the time the snapshot was taken. The list gets added to as more changes are made to the filesystem. The list of changed blocks is not limited just to new data in the files. Every time a file is accessed, its atime changes; this is a change to the metadata and will get added to the snapshot. So for example, if I were to do a du on some chunk of the commons image directories every day 213.197.137.3 in order to track growth, all the metadata from the change in access times would go into the snapshots and be sent over as part of the incrementals to our system with the replicated copy. Another example: if we rsync or send a tar of these files somewhere for an off site backup, metadata on each file and directory gets touched... and that goes into the next snapshot. FIX this by turning off atime on the zfs pool.
For versions of Solaris earlier than 10/08, When an incremental stream is being processed by a host (zfs recv), the filesystem that is being updated is unmounted and is inaccessible during the time of the stream processing. For this reason you should ensure that any host receiving snapshots has been updated. Cat /etc/release to check.
When an incremental stream is being processed by a host (zfs recv), the filesystem must be "rolled back" so that it looks just like the copy in the earlier snapshot. This means that access times etc. all get rolled back. If the filesystem is not created via a zfs snapshot in the first place, there is no way to sync it with its original from which it was copied, using incrementals; there is no way for the filesystem with the copy to be rolled back to a previous known state such that zfs can apply changes.
Zfs send streams can be saved as files, and the files can be used to restore from, as long as a contiguous series of incremental stream files is preserved. (There is always a risk that these files wind up with some corruption, but that's a separate issue.) If one or more of the incrementals is unavailable and cannot be regenerated from the snapshots on the filesystem itself (say, the snapshots were deleted), then restoral is impossible. If the snapshot from which the full was generated still resides on the filesystem to be copied, a new snap can be taken and an incremental generated from that. But there is no way to get a "diff" from two different full zfs send streams, even of the same filesystem, once the snapshots corresponding to those streams have been removed.