zenjabba / cephgeorep

An efficient unidirectional remote backup daemon for CephFS.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cephgeorep

Ceph File System Remote Sync Daemon
For use with a distributed Ceph File System cluster to georeplicate files to a remote backup server.
This daemon takes advantage of Ceph's rctime directory attribute, which is the value of the highest mtime of all the files below a given directory tree. Using this attribute, it selectively recurses only into directory tree branches with modified files - instead of wasting time accessing every branch.

Prerequisites

You must have a Ceph file system. rsync, scp, or similar must be installed on both the local system and the remote backup. You must also set up passwordless SSH from your sender (local) to your receiver (remote backup) with a public/private key pair to allow rsync to send your files without prompting for a password. For compilation, boost development libraries are needed. The binary provided is statically linked, so the server does not need boost to run the daemon.

Runtime Dependencies

Since the binary is statically linked, no boost runtime libraries are needed on the system. It only requires that rsync, etc. is installed.

Quick Start

  • Install
  • Initialize configuration file: cephfssyncd -d (This can be skipped if you installed from .rpm or .deb)
  • Edit according to Configuration: vim /etc/ceph/cephfssyncd.conf
  • Verify settings with dry run before seeding: cephfssyncd -s -d
  • Seed remote destination (this may take a while): cephfssyncd -s
  • Enable daemon: systemctl enable --now cephfssyncd

Installation

Current Release

Centos 7

  • yum install https://github.com/45Drives/cephgeorep/releases/download/v1.1.2/cephgeorep-1.1.2-1.el7.x86_64.rpm

Ubuntu

  • wget https://github.com/45Drives/cephgeorep/releases/download/v1.1.2/cephgeorep_1.1.2-1_amd64.deb
  • dpkg -i cephgeorep_1.1.2-1_amd64.deb

Installing from Source

  • yum install make gcc gcc-c++ boost boost-devel rsync
  • git clone https://github.com/45drives/cephgeorep
  • cd cephgeorep
  • make -j8 or make -j8 static to statically link libraries
  • sudo make install

Uninstalling from Source

  • In the same directory as makefile: sudo make uninstall

If you get the following error after running make:

/usr/bin/ld: cannot find -l:libboost_system.a
/usr/bin/ld: cannot find -l:libboost_filesystem.a

then run sed -i "s/\\.a\\b/.so/g" makefile to switch from static linking to dynamic linking.

Configuration

Default config file generated by daemon: (/etc/cephfssyncd.conf)

# local backup settings
Source Directory =            # full path to directory to backup
Ignore Hidden = false         # ignore files beginning with "."
Ignore Windows Lock = true    # ignore files beginning with "~$"
Ignore Vim Swap = true        # ignore vim .swp files (.<filename>.swp)

# remote settings
Remote User =                 # user on remote backup machine (optional)
Remote Host =                 # remote backup machine address/host
Remote Directory =            # directory in remote backup

# daemon settings
Exec = rsync                  # program to use for syncing - rsync or scp
Flags = -a --relative         # execution flags for above program (space delim)
Metadata Directory = /var/lib/cephfssync/
Sync Period = 10              # time in seconds between checks for changes
Propagation Delay = 100       # time in milliseconds between snapshot and sync
Processes = 4                 # number of parallel sync processes to launch
Threads = 8                   # number of worker threads to search for files
Log Level = 1
# 0 = minimum logging
# 1 = basic logging
# 2 = debug logging
# If Remote User is empty, the daemon will sync remotely as the executing user.
# Propagation Delay is to account for the limit that Ceph can
# propagate the modification time of a file all the way back to
# the root of the sync directory.

You can also specify a different config file with the command line argument -c or --config, i.e. cephfssynd -c /alternate/path/to/config.conf. If you are planning on running multiple instances of cephfssyncd with different config files, be sure to have unique paths for LAST_RCTIME_DIR for each config.

* The Ceph file system has a propagation delay for recursive ctime to make its way from the changed file to the top level directory it's contained in. To account for this delay in deep directory trees, there is a user-defined delay to ensure no files are missed. This delay was greatly reduced in the Ceph Nautilus release, so a delay of 100ms is the new default. This was able to sync 1000 files, 1MB each, randomly placed within 3905 directories without missing one. If you find that some files are being missed, try increasing this delay.

Usage

Launch the daemon by running systemctl start cephfssyncd, and run systemctl enable cephfssyncd to enable launch at startup. To monitor output of daemon, run journalctl -u cephfssyncd -f.

Arguments and Ad Hoc Commands

cephfssyncd usage:

cephfssyncd Copyright (C) 2019-2021 Josh Boudreau <jboudreau@45drives.com>
This program is released under the GNU General Public License v2.1.
See <https://www.gnu.org/licenses/> for more details.

Usage:
  cephfssyncd [ flags ]
Flags:
  -c --config </path/to/config> - pass alternate config path
                                  default config: /etc/ceph/cephfssyncd.conf
  -d --dry-run                  - print total files that would be synced
                                  when combined with -v, files will be listed
                                  exits after showing number of files
  -h --help                     - print this message
  -n --nproc <# of processes>   - number of sync processes to run in parallel
  -q --quiet                    - set log level to 0
  -s --seed                     - send all files to seed destination
  -t --threads <# of threads>   - number of worker threads to search for files
  -v --verbose                  - set log level to 2

Alternate configuration files can be specified using the -c --config flag, which is useful for running multiple instances of cephfssyncd on the same system. -n --nproc, -q --quiet, -t --threads and -v --verbose are used to override options from the configuration file. -s --seed is used for sending every file to the destination regardless of how old the file is. -d --dry-run will run the daemon without actually syncing any files to give the user an idea of how many files will be synced if actually ran. -d --dry-run combined with -v --verbose will also list all files that would be synced.

Usage with s3 Buckets

For use with backing up to aws s3 buckets, there is some special configuration to be done. The wrapper script s3wrap.sh included with the binary release allows the daemon to work with s3cmd seamlessly. Ensure s3cmd is installed and configured on your system, and use the following example configuration file as a starting point:

# local backup settings
Source Directory = /mnt/cephfs           # full path to directory to backup
Ignore Hidden = false         # ignore files beginning with "."
Ignore Windows Lock = true    # ignore files beginning with "~$"
Ignore Vim Swap = true        # ignore vim .swp files (.<filename>.swp)

# remote settings
# the following settings *must* be left blank for use with s3wrap.sh
Remote User =                 # user on remote backup machine (optional)
Remote Host =                 # remote backup machine address/host
Remote Directory =            # directory in remote backup

# daemon settings
Exec = /opt/45drives/cephgeorep/s3wrap.sh   # full path to s3wrap.sh
Flags = sync_1                              # place only the name of the s3 bucket here

# the rest of settings can remain as default ##########
Metadata Directory = /var/lib/cephfssync/
Sync Period = 10              # time in seconds between checks for changes
Propagation Delay = 100       # time in milliseconds between snapshot and sync
Processes = 1                 # number of parallel sync processes to launch
Threads = 8                   # number of worker threads to search for files
Log Level = 1

With this setup, cephfssyncd will call the s3cmd wrapper script, which in turn calls s3cmd put ... for each new file passed to it by cephfssyncd, maintaining the directory tree hierarchy.

Notes

  • If your backup server is down, cephfssyncd will try to launch rsync or scp and fail, however it will retry. All new files in the server created while cephfssyncd is waiting for rsync or scp to succeed will be synced on the next cycle.
  • Windows does not update the mtime attribute when drag/dropping or copying a file, so files that are moved into a shared folder will not sync if their Last Modified time is earlier than the most recent sync.
  • When the daemon is killed with SIGINT, SIGTERM, or SIGQUIT, it saves the last sync timestamp to disk in the directory specified in the configuration file to pick up where it left off on the next launch. If the daemon is killed with SIGKILL or if power is lost to the system causing an abrupt shutdown, the daemon will resync all files modified since the previously saved timestamp.
  • If the REMOTE_USER is specified as a user that does not exist on the remote backup server, rsync or scp will prompt for the user's password. Since it doesn't exist, when SSH fails the daemon will act as if the remote server is down and retry rsync or scp periodically.

45Drives Logo

About

An efficient unidirectional remote backup daemon for CephFS.

License:GNU Lesser General Public License v2.1


Languages

Language:C++ 92.7%Language:Python 3.6%Language:Makefile 3.2%Language:Shell 0.5%