Backups with rdiff-backup and duplicity

After losing my data a couple of times, I decided to get serious about backups and actual do them. I blogged about this loss a while back. I concluded that I must have the touch of death.

Strategy

The level of difficulty for creating a backup process ranges from easy to difficult, depending on how sophisticated you want to make them. Of course, the easiest way to backup is to simply copy everything on your computer to an external backup drive. Of course, it takes a lot of space, may require special permissions, and is difficult to pick through when you are trying to recover. Additionally, you only have the most recent backup, so you cannot look back in time if you need a file you deleted a couple of weeks ago. Ideally, you want to backup exactly the files you need, maintain daily (incremental) snapshots of them, and be able to recover them easily. You may also have to consider multiple local computers and perhaps even some remote ones, like your web server.

Ah, so it gets complicated quickly. However, regardless of how sophisticated you make your backup process, you must keep the following things in mind:

Backups should be automated (and consistent, to recover in cases when the computer was off when the backup ran)
There should be no single point of failure (even for your incremental snapshots)
Files should be easy to recover

Which Program?

I prefer to use both rdiff-backup and duplicity over other alternatives. Both scripts are wrappers around rsync that use rdiffdir to create incremental snapshots.

Originally, I did all my backups using rdiff-backup to an external USB harddrive (40GB), until it ran out of space. rdiff-backup is nice because it stores the snapshot as an exploded mirror and then stuffs the incremental changes inside of that in a special directory named rdiff-backup-data. Thus, you can navigate the backup without any special commands.

I then purchased a much larger (300GB) ethernet disk (NAS) from LaCie. Although it runs Linux, the only protocols it supports (that are useful under Linux) are SMB and FTP. This limitation rendered rdiff-backup useless since the exploded snapshot would be a mess of botched permissions and file names. I then discovered duplicity, which stores the backups strictly as rdiffdir tarballs and signatures (basically the contents of rdiff-backup-data). Since all the data is wrapped up in an archive, it is shielded from the storage medium. Hence, I now use the ftp protocol to push the changes onto the backup drive.

Both programs work in much the same way, so porting the script from rdiff-backup to duplicity was fairly trivial. Of course, on top of that, I needed to add my own little script to get the commandline configured and also to clean out old revisions all based on an external configuration file.

Remote Command Execution

The most fundamental part of rdiff-backup, and probably all backup systems, is to serve the backup from the remote machine. It is convenience, for the sake of example, to assume that all machines are local and can mount each other, but this is not often the case. Additionally, having mounts is unreliable since they don’t often survive restarts gracefully (okay, and it also involves additional, tedious configuration).

This part of the discussion brings us to rsync over ssh. Of course, the phrase "rsync over ssh" is almost always followed immediately by "ssh keys". SSH is a nice and secure protocol, but out of the box, it requires user interaction before any additional steps are taken. When folks create ssh keys, the tendency is to create them with no passphrase. I also use this technique, but this tends to be quite a large security hole since it means that a hacker could use it to gain access to another machine.

So what do we do? There just so happens to be a feature of ssh keys that allows for a restricted environment and perfect for serving backups. The authorized key can be configured to execute only a single command. The syntax looks like:

from="10.10.10.10",command="~/bin/rdiff-backup-wrapper.sh" ssh-rsa ...

With this entry in place in the ~/.ssh/authorized_keys2 file, any connection using this key will automatically execute the command listed, ignoring any command specified from the commandline.

As I discovered when setting up my backups, the problem with this configuration is that two different backup profiles running against the same machine using the same ssh key will not work because the same wrapper command will be executed (I needed more specificity).

SSH_ORIGINAL_COMMAND

Enter SSH_ORIGINAL_COMMAND. Again, ssh to our rescue. There happens to be special environment variable which is set when ssh executes agains a remote machine. SSH_ORIGINAL_COMMAND contains the parameters after the hostname in the calling ssh command. Using this command we could either execute it directly (which would defeat the whole purpose of the command restriction), or we can pass it as a parameter to a wrapper script which can then interpret this value and run an appropriate command.

In the case of my backups, I use the following entry:

from="10.10.10.10",command="~/bin/rdiff-backup-wrapper.sh ${SSH_ORIGINAL_COMMAND:-}" ssh-rsa ...

In that wrapper, I interpret the command and serve the backup accordingly. Doing this allows me to use the same ssh key to serve different backup profiles.

Calling rdiff-backup

Once that is configured, and I have some reasonable defaults for my rdiff-backup script, I use the following command as my remote schema for rdiff-backup.

ssh -C -i ~/.ssh/id_rsa_backup %s ~/bin/rdiff-backup-wrapper.sh mail

In this case, I am running my mail backup, using my backup identity (with no passphrase).

Resources

rsync Tips & Tricks