The Wake-on-LAN, ZFS-mirror, rsnapshot Backup

This is one of the more complicated software-hardware constructs in my house, and also one of the more complex I've attempted to describe in this blog. Its value to other may be limited, or only aspirational, or - I hope - practical, even if you don't implement all of it.

Here's the outline: my main machine (a stationary laptop) backs itself up via a cron job to a server with a 6TB mirrored ZFS drive using wake-on-lan (aka "WoL"), SSH, and rsnapshot, and finally sleeps the server again when the backup is done (if the machine was initially asleep).

You can pick and choose parts of this to implement: maybe you have an always-on server (that would be good - I consider the WoL part the likeliest point of failure). I don't because this server runs stinkin' hot and pushes up my cooling bills in the summer (thus WoL). Maybe you'll choose to backup using rsnapshot on a single drive rather than a mirror: your call, your choice of level of backup security.

There are multiple assumptions, possibly the biggest being that you use SSH keys for access to your machines and that you have direct root access with those keys.

Some explanation: I already do full backups of my main computer/laptop to external USB HDs, including a pair of drives that I trade off-site with a friend. So this is overkill ... although I've heard it argued there's no such thing as "overkill" when it comes to backups.

Setting it up

I had this big workstation box sitting in a corner. I knew this machine worked with WoL: most wired network cards do these days, but it's worth checking multiple times because I have one or two that don't and a couple machines that wake a couple times ... and then stop responding. So check carefully. Also, WoL doesn't work with USB network dongles and, while WoL exists for WiFi, it's apparently quite unreliable.

I bought two 6 TB spinning disks from Amazon - I deliberately bought drives from two different manufacturers, because with same-manufacturer you're likely to get the same batch of drives, and they tend to fail close together. This is not desirable when building a NAS/mirror/RAID. Better to have drives that are the same size but otherwise different. These were put into the machine (a process made mildly hellish by Dell's proprietary drive trays, but eventually achieved), and then formatted with ZFS as mirrored drives. The backup volume is encrypted, and so requires a passphrase at some point after boot to mount it.

Getting SSH Agent Access

One of the first things I discovered is that my backup script on the client run from cron didn't automatically have access to the ssh-agent running in my X session (yes, I'm still running X not Wayland). It's not barred from having access, it just doesn't have it automatically. So I wrote a script to find the SSH socket for the agent ...:

#!/usr/bin/env bash
# name: sshfindsock
sock="$(find /tmp/ -wholename '*ssh-X*' -user "$(id --user)" 2> /dev/null | tail -n 1)"
echo "SSH_AUTH_SOCK=\"${sock}\""
echo "export SSH_AUTH_SOCK"

Keep in mind this is naively written and works fine on a single user system: it would probably work on a multi-user or "complicated" system, but all I'm sure of is it's been working reasonably well in my context. As ssh-agent is often used, this script is run with eval sshfindsock because it has to work in the current shell (not a subshell).

Understanding rsnapshot

The above is used in the next step / script, which wakes the remote and then uses SSH and rsnapshot to do a backup. I find I haven't blogged about rsnapshot. It's a Perl layer on top of rsync, but that indirectly belittles one of the best backup tools available. From the man page:

rsnapshot is a filesystem snapshot utility. It can take incremental
snapshots of local and remote filesystems for any number of machines.

rsnapshot saves much more disk space than you might imagine. The amount
of space required is roughly the size of one full backup, plus a copy
of each additional file that is changed. rsnapshot makes extensive use
of hard links, so if the file doesn't change, the next snapshot is
simply a hard link to the exact same file.

Read that last bit several times. rsnapshot creates a series of folders for each machine backed up, usually (or at least this is how I do it, you can experiment with other arrangements) named daily.{0..6}, weekly.{0..4}, and monthly.{0..11}. (I'm using a Bashism to portray sequences of numbers: most people should be able to read that even if they're not familiar with the Bashism ...) The more hardcore among us might even have hourly.{0..23} or yearly{0..?}. Again, your call. The behaviour is governed by the /etc/rsnapshot.conf file on the backup server, which I'll address shortly.

That part about the hard links is really important: files that are unchanged since the previous backup are simply hard-linked from the previous backup. But the structure under any given day looks like the source machine, so it remains easy to find files. This also conserves an immense amount of space. It does have the possibility of burning through your max node count if you have A) a lot of files and B) a lot of backups, but that's a story for another day.

Configuring rsnapshot

The next thing you have to get your head around is the /etc/rsnapshot.conf configuration file, which I found kind of a pain. For me, maintaining separate configs for each backup job makes more sense than one config for all jobs (although I can see advantages to both approaches).

retain      daily   7
retain      weekly  4
retain      monthly 12
rsync_long_args     --delete --numeric-ids --relative --delete-excluded --sparse
one_fs              1
backup      giles@debianlaptop.local:/home/         debianlaptop/

This is NOT the whole file, just some of the important settings. You shouldn't be copying my config, you should read the docs and understand what's going on. And remember: those aren't spaces between words, rsnapshot.conf REQUIRES Tabs. As I mentioned above, I retain a week of dailies, a month of once-a-week backups, and then a year of once-a-month backups. rsync_long_args is in there because this is a likely place to set unusual rsync settings you use in your environment. And I found one_fs important as this system was otherwise trying to backup remote-mounted shares.

The cron Job

The script on the client that's run daily:

#!/usr/bin/env bash
eval $(sshfindsock)
# debug level can be:
#       "" (blank)
#       "-v"
#       "-V" - shows every file transferred, and failures to transfer
#       "-D" - called "a firehose of diagnostic information" by the man page
debug="-v"
log="${HOME}/tmp/$(basename "${0}").$(date +%Y%m%d%H%M.%S).log"

if ! ping -c 1 ${backupHost}
then
    /home/giles/bin/gwol ${backupHost}
    wokeRemote=true
    # 5 seconds for remote to wake was inadequate, trying 30
    sleep 30
else
    wokeRemote=false
fi

# run the daily every day ...
ssh -A root@${backupHost} "rsnapshot ${debug} -x -c /root/rsnapshot.${backThisUp}_back.conf daily" >> "${log}"

# on Sundays, run the weekly backup
if [ "$(date +%A)" == "Sunday" ]
then
    ssh -A root@${backupHost} "rsnapshot ${debug} -x -c /root/rsnapshot.${backThisUp}_back.conf weekly" >> "${log}"
fi

# on the first of the month, run the monthly
if [ "$(date +%d)" == "1" ]
then
    ssh -A root@${backupHost} "rsnapshot ${debug} -x -c /root/rsnapshot.${backThisUp}_back.conf monthly" >> "${log}"
fi

if ${wokeRemote}
then
    /home/giles/bin/sol ${backupHost}
fi

Two scripts that are important to this are gwol and sol - the first being "Giles's Wake-on-LAN," the second being "Sleep-on-LAN". I've written fairly complex code around each because I generalized them to wake and sleep multiple machines, but at its heart gwol is essentially ssh root@localhost "wakeonlan ${ether}". Every version of Linux seems to have a different WoL utility: this is the one Debian uses, you may also encounter ether-wake or wol.

sol is essentially ssh root@${server} "systemctl suspend -i".

You may also notice that I only wait 30 seconds for the remote server to wake if it's not already awake. For some servers, including mine, this isn't enough for a full boot. This is because the expectation is that we're waking from suspend: if this wakes it from OFF, I'm already screwed. The HD is encrypted and a password has to be entered or the server won't boot (this isn't a great behaviour for a server - but this is my house, not a server room). So 30 seconds is enough.

Conclusion

This system has been working successfully for several months. There have been problems: a password is required to boot the server. And another password (well after boot) is required to mount the ZFS mirrored drives. This isn't a great setup, but I've had a LOT of problems with entering ZFS passwords at boot, and I only reboot the machine about once a month. For the most part this system has, despite its complexity, been working remarkably well.