hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parsyncfp 1.67 and earlier sometimes exits before all rsync processes have completed

novosirj opened this issue · comments

Here's some output from our most recent run on one of our three campuses; I mentioned in Issue 34 that there seems to be a bug where sometimes PFP exits while rsync processes are still running, which is incorrect. You can see in the below output that PFP exists, and then there are these lines:

* TRANSFER projects parallel rsync wait BEGIN: 2020-05-12_23:41:59
* TRANSFER projects parallel rsync wait END: 2020-05-13_03:20:07

What is happening during that time is a pgrep -x rsync loop to ensure that there are no longer any rsyncs running that was added by my colleague Bill Abbott, probably to deal with this problem. As you can see, it was several hours before all of the rsync processes completed. Here's the full output:

* TRANSFER projects parallel BEGIN: 2020-05-12_23:01:04

^[[0m^[[1;35mWARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-backups-projectsc/fpcache].
  Enter ^C to stop this.
  If you specified '--nowait', cache will be cleared in 3s regardless.
  Otherwise, hit [Enter] and I'll clear them.
^[[0m^[[1;34mINFO: The fpart chunk files [/root/.parsyncfp-backups-projectsc/fpcache/f*] are cleared .. continuing.
^[[0m^[[1;34mINFO: The TRIMPATH you specified exists, is a dir, and is readable.
^[[0m^[[1;34mINFO: Alternative file list is readable; converting list to chunks.
^[[0m^[[1;34mINFO: Forking fpart with PID = [14688].  Check [/root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12] for errors if it hangs.
^[[0m^[[1;34mINFO: Starting the 1st [12] rsyncs ..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.0]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.1]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.2]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.3]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.4]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.5]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.6]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.7]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.8]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.9]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.10]..
^[[0m^[[1;34mINFO: Starting rsync for chunkfile [/root/.parsyncfp-backups-projectsc/fpcache/f.11]..
^[[0m        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-05-12]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
23.11.52   40.67     6.41     320.95 / 0.00            12    <>   0          [27] of [32]
^[[1;34mINFO: Done.  Please check the target to make sure
  expected files are where they're supposed to be.
^[[0m^[[1;34mINFO:
    The parsyncfp cache dir takes up [277M      /root/.parsyncfp-backups-projectsc]
    Don't forget to delete it, but wait until you are sure that your job
    completed correctly, so you don't need the log files anymore.

Reminder: check the parsyncfp log:
  [/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31]
and the fpart log:
  [/root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12]
for errors. Use '--verbose=1' for less output.
You transferred [1.28184e+14 bytes = 116.58267 TB] via all [12] rsyncs.

      Thanks for using parsyncfp. Tell me how to make it better.
                       <hjmangalam@gmail.com>

^[[0m
* TRANSFER projects parallel END: 2020-05-12_23:41:54

* TRANSFER projects parallel rsync wait BEGIN: 2020-05-12_23:41:59
* TRANSFER projects parallel rsync wait END: 2020-05-13_03:20:07

While this rsync loop was running, I checked for running rsync processes:

[root@quorum03 ~]# ps -ef | grep rsyc
root      7548  4342  0 00:09 pts/1    00:00:00 grep --color=auto rsyc
[root@quorum03 ~]# ps -ef | grep rsync
root      4151     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.27  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4153  4151 45 May12 ?        00:12:47 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.27 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4158  4153 33 May12 ?        00:09:17 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4164     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.28  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4166  4164 43 May12 ?        00:12:15 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.28 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4167  4166 31 May12 ?        00:08:44 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4177     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"
  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.29  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4179  4177 44 May12 ?        00:12:23 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-project
sc/fpcache/f.29 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4180  4179 30 May12 ?        00:08:38 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:12:37 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4192  4191 29 May12 ?        00:08:15 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:11:35 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4204  4203 25 May12 ?        00:07:13 ssh -x -c aes128-gcm@openssh.com -o Compression=no nas1 rsync --server -slogDtpRe.LsfxC
root      7584  4342  0 00:09 pts/1    00:00:00 grep --color=auto rsync
root     11262 11259  0 May12 ?        00:00:00 /bin/bash -x /usr/local/bin/asb-parsyncfp-0.9.sh
root     14280     1  0 May05 ?        00:00:29 perl /usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" --maxload 96 --chunksize=10T --fromlist=/gpfs/cache/home/root/policy/projectsc.list.allfiles.clean --trimpath=/projects/.snapshots/projectsc --trustme nas1:/zdata/gss/projectsc
root     25307  6185  0 May12 pts/0    00:00:00 tail -f /var/log/asb-parsyncfp-20200512.log /var/log/asb-parsyncfp-20200512.err
[root@quorum03 ~]# ps -ef | grep rsync | grep f\...
root      4151     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.27  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4153  4151 45 May12 ?        00:13:15 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_27 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.27 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4164     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.28  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4166  4164 43 May12 ?        00:12:42 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_28 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.28 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4177     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.29  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4179  4177 44 May12 ?        00:12:51 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_29 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.29 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:13:05 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:12:02 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-fro =/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root     11262 11259  0 May12 ?        00:00:00 /bin/bash -x /usr/local/bin/asb-parsyncfp-0.9.sh
root     14280     1  0 May05 ?        00:00:29 perl /usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no" --maxload 96 --chunksize=10T --fromlist=/gpfs/cache/home/root/policy/projectsc.list.allfiles.clean --trimpath=/projects/.snapshots/projectsc --trustme nas1:/zdata/gss/projectsc
root     25307  6185  0 May12 pts/0    00:00:00 tail -f /var/log/asb-parsyncfp-20200512.log /var/log/asb-parsyncfp-20200512.err
[root@quorum03 ~]# ps -ef | grep rsync | grep f\.3.
root      4189     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4191  4189 45 May12 ?        00:13:09 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_30 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.30 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc
root      4201     1  0 May12 ?        00:00:00 sh -c cd /projects/.snapshots/projectsc && rsync  --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"  --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31  /projects/.snapshots/projectsc  nas1:/zdata/gss/projectsc & echo "${!}" >> /root/.parsyncfp-backups-projectsc/fpcache/rsync-PIDs-23.01.04_2020-05-12
root      4203  4201 41 May12 ?        00:12:05 rsync --bwlimit=1000000 -a -s --log-file=/root/.parsyncfp-backups-projectsc/rsync-logfile-23.01.04_2020-05-12_31 -a -e ssh -x -c aes128-gcm@openssh.com -o Compression=no --files-from=/root/.parsyncfp-backups-projectsc/fpcache/f.31 /projects/.snapshots/projectsc nas1:/zdata/gss/projectsc

As you can see, they are numerous.

We still have an error message in the fpart.log -- it's not clear to me whether it's related, or how I would go about figuring out what is upsetting it:

[root@quorum03 ~]# cat /root/.parsyncfp-backups-projectsc/fpcache/fpart.log.23.01.04_2020-05-12
Examining filesystem...
Filled part #0: size = 5522565562195, 39213 file(s)
Filled part #1: size = 5541427793660, 97253 file(s)
Filled part #2: size = 5545874130628, 60812 file(s)
Filled part #3: size = 5509769400795, 40104 file(s)
Filled part #4: size = 5499888680149, 56216 file(s)
Filled part #5: size = 5506487195394, 39990 file(s)
Filled part #6: size = 5506709107196, 30690 file(s)
Filled part #7: size = 5506735624280, 24304 file(s)
Filled part #8: size = 5499734697022, 49527 file(s)
Filled part #9: size = 5508541016486, 39374 file(s)
Filled part #10: size = 5497866496719, 55087 file(s)
Filled part #11: size = 6835297759526, 11085 file(s)
Filled part #12: size = 5497819066318, 28734 file(s)
Filled part #13: size = 5511722393336, 58762 file(s)
Filled part #14: size = 5505785371697, 19059 file(s)
Filled part #15: size = 5504441490693, 48614 file(s)
Filled part #16: size = 5512316116286, 39688 file(s)
Filled part #17: size = 5500649696382, 41383 file(s)
Filled part #18: size = 5512471694650, 52756 file(s)
Filled part #19: size = 5553281642311, 112159 file(s)
Filled part #20: size = 5501550103865, 54165 file(s)
Filled part #21: size = 5508397812239, 57320 file(s)
Filled part #22: size = 5577036109281, 23500 file(s)
Filled part #23: size = 5499498469383, 62866 file(s)
Filled part #24: size = 5538987318102, 21086 file(s)
Filled part #25: size = 5498235348656, 21008 file(s)
Filled part #26: size = 5498119523024, 29184 file(s)
Filled part #27: size = 5497635536928, 21893 file(s)
Filled part #28: size = 5501677782525, 44983 file(s)
Filled part #29: size = 5500261233014, 63159 file(s)
Filled part #30: size = 5502396364451, 52938 file(s)
error parsing input values: 
Filled part #31: size = 4576145248049, 96973 file(s)
1493885 file(s) found.

But we have all of the cache files from the run, so we can look them over and run tests in the interim (the transfer finished very fast this week so I have till Tuesday night to tinker).

Yes, the child rsyncs are absolutely children of the PFP process that's still running. We don't run multiples at all on this system, and only have been running one at a time otherwise on the system that does run more than one, and you can see that the command lines for the rsync processes above correspond to the parsyncfp command line.

I'll be starting up a new round tomorrow night. It looks like the critical line here is 928, so if there's some instrumentation it would be helpful to add here, let me know.

Re: fpart, yes, we use the PFP option to read file sizes from a list. Here is our full command line for PFP:

/usr/local/bin/parsyncfp-1.67 -i ens6 --checkperiod 1800 --nowait --altcache /root/.parsyncfp-backups-$SNAPSHOT --dispose c -NP 12 --rsyncopts '-a -e "ssh -x -c aes128-gcm@openssh.com -o Compression=no"' --maxload 96 --chunksize=5T --fromlist=$HOMEDIR/$SNAPSHOT.list.allfiles.clean --trimpath=/$MOUNTPOINT/.snapshots/$SNAPSHOT --trustme $BACKUPHOST:/zdata/gss/$SNAPSHOT

(I notice --dispose c doesn't seem to work either, but maybe I'm not specifying that correctly?)

I do know that we have some filenames with junk in them -- mostly carriage returns at the end. I tried what you suggested on the fpart log and he's right:

[root@quorum03 .parsyncfp-backups-projectsc]# cat -bet fpcache/fpart.log.23.01.04_2020-05-12 | grep error
    33  error parsing input values: ^I$

Do you know which part should be the problem part? The output makes it unclear whether it's the one before or after, but neither the f.30 file or f.31 seem to contain a ^I:

...
    32  Filled part #30: size = 5502396364451, 52938 file(s)$
    33  error parsing input values: ^I$
    34  Filled part #31: size = 4576145248049, 96973 file(s)$

Thanks for your assistance!

I'll add prints of those PIDs, etc. before I run this again.

When this has gone well, I typically run my script via at, and have a & on the invocation of the script. On weeks where I'm manually waiting for last week's run to finish, or I'm trying to keep a closer eye on this, I'll run it like myscript.sh & and then run disown %% so that if my shell is dropped, the command will not. I suspect that may be similar to what you're seeing. I also notice that it's not typical that the rsync processes will die if kill PFP, so that seems to agree. We don't run PFP itself inside that script with an & or anything

I know at one time there was a bug where PFP wasn't careful to confirm that one of the rsync PIDs wasn't reused by something else. Since this runs for several days on our system -- I believe this week Tuesday at 23:00 to sometime late Saturday -- there is more chance for that to happen. But I believe you already made changes in that area.

We'll see what it prints out this go 'round.

Re: the fpart error, the filenames don't appear to crash rsync. Another interesting thing that happens in this area (probably better for another ticket -- I only raised the fpart problem at all in case it related to this early exit problem somehow) is that the rsync processes that run within PFP complain about problem filenames, whereas the final rsync that we run with --delete does not. But I think the reason there is that somewhere upstream -- likely the fpart step -- is dropping special characters, and the file doesn't exist at a name that does not contain them. The final rsync is silent on these/does appear to transfer them.

This didn't happen on this run, though I ran it via at without the &. I may go back to the other way next week to see if that exposes whatever is going on.

Here's the output I did capture, at any rate:

Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
Time to debug weird exit:
rPIDs: 14315 22586
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
02.54.30   262.62     4.96      60.26 / 0.00             2    <>   0          [33] of [33]
Time to debug weird exit:
rPIDs:
sPIDs:
CUR_FPI: 33
nbr_cur_fpc_fles: 33
FPART_RUNNING: 0
^[[1;34mINFO: Done.  Please check the target to make sure
  expected files are where they're supposed to be.
^[[0m^[[1;34mINFO:
    The parsyncfp cache dir takes up [439M      /root/.parsyncfp-backups-projectsc]
    Don't forget to delete it, but wait until you are sure that your job
    completed correctly, so you don't need the log files anymore.