usatiuk / backup

A deduplicating backup tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Backup app

Key features

  • Deduplicated backups with additional compression and encryption

  • Files are split into chunks and stored in a repository as a list of pointers to these chunks

  • These chunks are reused for all files in the repository, based on their hash matching.

  • If when creating an archive, a file hasn’t changed it is reused from a previous archive

  • Content defined chunking algorithm based on a rolling hash

    • Unlike when splitting file in fixed size chunks, still works when there is an insertion/deletion of bytes in middle/beginning of the file
  • These Files, Chunks and Archives are stored in object storage Repository

    • Default implementation - FileRepository, stores these objects in the file system

    • Possible other implementations - database, cloud object storage…

  • Multithreading

  • Directories are ignored if they have a .nobackup file

  • Files are ignored in a directory recursively based on a .ignore file - one line contains one regex rule, which, if a directory entry matches, it is ignored

Quick start

Initialize a repository with compression and encryption (you can also specify other options like –from here, they will be written to the repository if it’s possible, so far changing options of an already existing repo is not implemented)

sembackup init --repo <target dir> --compression zlib --compression-level 4 --encryption aes --password <password> --salt <random salt>

Run a backup

sembackup run --from <source dir> --repo <target dir> --password <password>

List avaiable archives

sembackup list --repo <repo dir> --password <password>

List files in an archive

sembackup list-files --repo <repo dir> --password <password> --aid <archive id>

Restore a backup

sembackup restore --repo <repo dir> --password <password> --aid <archive id> --to <destination>

Compare source dir with latest archive

sembackup diff --repo <repo dir> --password <password> --from <source dir>

Compare source dir with an archive

sembackup diff --repo <repo dir> --password <password> --from <source dir> --aid <archive id>

Compare two archives

sembackup diff --repo <repo dir> --password <password> --from <source dir> --aid <archive id> --aid2 <archive id>

Compare subdirectory in the source dir with latest archive

sembackup diff --repo <repo dir> --password <password> --from <source dir> --prefix <subdir>

Data format

All data is represented as objects, stored in a repository.

An Archive object represents a snapshot of the file system in the moment of its creation, and consists of a list of pointers (Object ids) to File objectsю

File object consists of its basic metadata, and a list of chunks, identified by their ids, which can be shared between multiple files (and within the same file) if their MD5 hashes match.

Chunk object is a binary blob, identified by its MD5 hash.

These objects are children of Object, providing a getKey() method, (name for Archive, path for File, and MD5 hash for Chunk) which is used by Repository to make them easily accessible.

In default (and so far the only) repository implementation FileRepository these objects are grouped together into files of size approixmately repo-target MB, (by default 128), and there exists a key-value index of indexed objects written into index, and an offsets file recording the location and offset of each object in the file system.

About

A deduplicating backup tool


Languages

Language:C++ 86.7%Language:Shell 10.3%Language:CMake 2.8%Language:C 0.1%