hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't seem to get parsyncfp 1.63 or newer to work right

novosirj opened this issue · comments

Here's an example of what happens, tested with parsyncfp 1.65 with the -d flag so we could see what is going on. It doesn't correctly do the fpart section, and it seems like never generates a list based on my input list?

[root@quorum03 bin]# env | grep -ie homedir -ie snapshot -ie mountpoint -ie backuphost                                                            BACKUPHOST=nas1                                                                                                                                   SNAPSHOT=projectsc
MOUNTPOINT=projects
HOMEDIR=/gpfs/cache/home/root/policy
[root@quorum03 bin]# /usr/local/bin/parsyncfp-1.65 -d -i ens6 --checkperiod 1800 --nowait --altcache '/root/.parsyncfp-backups' -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=10T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT
DEBUG[359]: FPARTSIZE set to: [10T]
FPARTSIZE_N set to [10995116277800]
Press [ENTER] to continue.

DEBUG[380]: Using network interface [ens6] with connection quality [ 100000Mb/s]
Press [ENTER] to continue.

DEBUG[385]: 1m load is [0.85] and the 1m Load:#CPU ratio is [0.0354166666666667] ( [24] CPU cores).
    OK to continue.
Press [ENTER] to continue.

WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups/fpcache].
  Enter ^C to stop this.
  If you specified '--nowait', cache will be cleared in 3s regardless.
  Otherwise, hit [Enter] and I'll clear them. 

INFO: The fpart chunk files [/root/.parsyncfp-backups/fpcache/f*] are cleared .. continuing.
INFO: The TRIMPATH you specified exists, is a dir, and is readable.
INFO: Alternative file list is readable; converting list to chunks.
DEBUG[624]: fpart fork cmd:
[cd /projects/.snapshots/projectsc; fpart -v -L -s 10995116277800 -a  -i /tmp/pfp.tmp -o /root/.parsyncfp-backups/fpcache/f 2> /root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29 & echo "${!}" > /root/.parsyncfp-backups/fpcache/FP_PIDFILE00.41.53_2020-04-29]
Press [ENTER] to continue.
INFO: Forking fpart with PID = [6639].  Check [/root/.parsyncfp-backups/fpcache/fpart.log.00.41.53_2020-04-29] for errors if it hangs.
INFO: Waiting for fpart to be forked.
INFO: [/root/.parsyncfp-backups/fpcache/f.0] visible.
INFO: Waiting [1]s for chunk files to be written.
INFO: Starting the 1st [12] rsyncs ..
DEBUG[685]: Complete rsync cmd = [cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000  -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" -a -s --log-file=/root/.parsyncfp-backups/rsync-logfile-00.41.53_2020-04-29_0 --files-from=/root/.parsyncfp-backups/fpcache/f.0  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc  & echo "${!}" >> /root/.parsyncfp-backups/fpcache/rsync-PIDs-00.41.53_2020-04-29]
Press [ENTER] to continue.

DEBUG[704]: OUT OF RSYNC STARTUP LOOP
Press [ENTER] to continue.

WARN: 
    The number of chunk files generated by fpart [1] < the # of rsync 
    processes you specified [12].  
    Did you check the dir tree / file list to make sure you're setting the chunk 
    size appropriately (--chunksize) ?  It's currently set to [10T].
     
        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-29]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
DEBUG[1105]: sPIDs string in get_rPIDs() = []
Press [ENTER] to continue.
^C
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(638) [sender=3.1.2]

[root@quorum03 bin]#

I'm not sure why this happens. Here's what the fpart log looks like (this wasn't the exact same run, but this is the same as it looked like; I didn't save the output):

[root@quorum03 ~]# cat /root/.parsyncfp-backups/fpcache/fpart.log.00.40.37_2020-04-29
Examining filesystem...
error parsing input values:
Filled part #0: size = 0, 1 file(s)
1 file(s) found.

I'm not sure what's going on here, or how to debug. Originally, I didn't have stats and scut installed. An issue on my system is that there is a stats command that belongs to Confluent that's installed on the system and in the PATH. So I change the PATH with export PATH=... where PATH no longer contains the Confluent PATH with stats but does contain /usr/local/bin which is where I have stats, scut, and fpart that are required for parsyncfp. But I'm not sure if that's not working out somehow, or if something else is going wrong?

The only other thing I can think of is that our file specified by --fromlist= actually is formatted with a space between the filesize and the PATH instead of the that's specified in the documentation/usage info. This has never been a problem before, however.

For the moment, I've dropped back to 1.61 to continue my backups. Never tried 1.62.

Any help is welcome whenever it comes; no real rush for us. I won't be trying anything until next Tuesday most likely (we run this weekly at 23:00 EST/EDT on Tuesdays).

I'll have to have it try to do that transfer again when it wraps up work on that filesystem to see what the content of the fpart files is; the amount of source data changes from week to week, theoretically, so it might not happen next week. I never considered what might happen if the parts are broken up in 10TB chunks and there aren't more than a couple of them.

1.67 seems to have solved this problem.