Incremental Web-Backup with Data Aggregation

I searched for hours for an application that performs incremental backup to an online destination (say, SSH) and also uses data aggregation--uploading a single file containing all of the files that need to be backed up. Why did I want a backup utility with that exact design?

Because I learned through experience that most of the time in transferring average-sized files is spent in communicating the meta-data, file tree, and in waiting for responses from the destination servers (as you may know, it takes much longer to upload 1000 2 byte files than it takes to upload a single 10kb file). So how do we do it?

The Solution

The solution, which apparently nobody else has come up with, is to create and upload a single contiguous backup file, and can be broken into three parts. The following describes what the backup utility will do.

Part One: Backing up for the first time:

  1. create a tarchive of the backup into a file like backup.tar
  2. compile a small database (using SQLite is fine) that contains the file meta-data (including the last-modified fimes and the hashed file contents) and the file-tree
  3. upload them to the destination.

Part Two: Subsequent backups:

  1. query the database to determine
    1. what files need to be removed
    2. what files are new
    3. what files have been updated (using last-modified times and the hash codes)
  2. update backup.tar by
    1. removing from backup.tar the files that need to be updated (server-side
    2. creating another tarchive containing the new and updated files (local)
    3. uploading that tarchive in one-megabyte increments, hash-checking for file integrity along the way (server-side)
    4. merging the one-megabyte files with backup.tar (server-side)
  3. upload the updated database

Part Three: Restoring backups:

Download the tarchive and extract it, maintaining the meta-data (like permissions, ownership, last-modified time, etc).

Simple, quick, and not network-intensive. If merging zip files were fast enough, the utility can use that instead of tar, to further save on bandwidth and thus cut down on time. What do you think?

Bookmark and Share

Comments

Security issues, perhaps? I

Security issues, perhaps? I would think that having a big blatant file called backup.zip would be a good target for anyone who piggybacks the network.

But then again, I didn't understand a lot of that up there.

How about encryption?

How about encryption?

Wouldn't that make it

Wouldn't that make it slower?

I'm sure when you're backing

I'm sure when you're backing stuff up over the Internet most of the time is spent during the data transmission. Nowadays, with everyone having at least two cores on their system, safe (just-in-time) encryption should be a breeze.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.