fpart and parasync mismatching parts

Question

fpart and parasync mismatching parts

ADiEmme opened this issue 5 years ago · comments

Hi all,

We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB)
Since it looked suspicious, i ran an strace on parsync and this is what i noticed:
[root@filler002 fpcache]# strace -ff -tt -s 4096 -p 2858510
strace: Process 2858510 attached
09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0
09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached

So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs:
[root@filler002 fpcache]# pwd
/root/.parsyncfp-TMC/fpcache
[root@filler002 fpcache]# ps aux | grep -i fpart
root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart
[root@filler002 fpcache]# ls -lah
total 35M
drwxr-xr-x 2 root root 4.0K Nov 11 14:05 .
drwxr-xr-x 3 root root 4.0K Nov 11 14:05 ..
-rw-r--r-- 1 root root 351K Nov 11 13:54 f.0
-rw-r--r-- 1 root root 930K Nov 11 13:54 f.1
-rw-r--r-- 1 root root 1.1M Nov 11 13:58 f.10
-rw-r--r-- 1 root root 2.4M Nov 11 13:59 f.11
-rw-r--r-- 1 root root 2.0M Nov 11 13:59 f.12
-rw-r--r-- 1 root root 1.6M Nov 11 13:59 f.13
-rw-r--r-- 1 root root 2.7M Nov 11 13:59 f.14
-rw-r--r-- 1 root root 1.9M Nov 11 14:00 f.15
-rw-r--r-- 1 root root 1.2M Nov 11 14:00 f.16
-rw-r--r-- 1 root root 3.0M Nov 11 14:00 f.17
-rw-r--r-- 1 root root 1.8M Nov 11 14:00 f.18
-rw-r--r-- 1 root root 803K Nov 11 14:01 f.19
-rw-r--r-- 1 root root 12K Nov 11 13:55 f.2
-rw-r--r-- 1 root root 1.9M Nov 11 14:02 f.20
-rw-r--r-- 1 root root 567K Nov 11 14:02 f.21
-rw-r--r-- 1 root root 1.8M Nov 11 14:03 f.22
-rw-r--r-- 1 root root 2.7M Nov 11 14:03 f.23
-rw-r--r-- 1 root root 1.1M Nov 11 14:04 f.24
-rw-r--r-- 1 root root 1.6M Nov 11 14:05 f.25
-rw-r--r-- 1 root root 1.2M Nov 11 13:56 f.3
-rw-r--r-- 1 root root 493K Nov 11 13:57 f.4
-rw-r--r-- 1 root root 721K Nov 11 13:57 f.5
-rw-r--r-- 1 root root 14K Nov 11 13:57 f.6
-rw-r--r-- 1 root root 64K Nov 11 13:58 f.7
-rw-r--r-- 1 root root 452K Nov 11 13:58 f.8
-rw-r--r-- 1 root root 2.6M Nov 11 13:58 f.9
-rw-r--r-- 1 root root 1.4K Nov 11 14:06 fpart.log.13.54.12_2019-11-11
-rw-r--r-- 1 root root 8 Nov 11 13:54 FP_PIDFILE13.54.12_2019-11-11
-rw-r--r-- 1 root root 208 Nov 11 14:05 rsync-PIDs-13.54.12_2019-11-11

[root@filler002 fpcache]# cat fpart.log.13.54.12_2019-11-11
Examining filesystem...
Filled part #0: size = 488118230349, 119714 file(s)
Filled part #1: size = 488126329044, 117869 file(s)
Filled part #2: size = 488119026333, 329636 file(s)
Filled part #3: size = 488195328885, 189742 file(s)
Filled part #4: size = 489718855986, 126551 file(s)
Filled part #5: size = 523602512613, 92303 file(s)
Filled part #6: size = 507470117001, 26582 file(s)
Filled part #7: size = 564502186104, 115597 file(s)
Filled part #8: size = 552879767274, 66996 file(s)
Filled part #9: size = 488970973281, 39277 file(s)
Filled part #10: size = 506558236575, 120349 file(s)
Filled part #11: size = 488124669226, 60410 file(s)
Filled part #12: size = 507594913183, 64456 file(s)
Filled part #13: size = 488121789469, 71924 file(s)
Filled part #14: size = 488117693970, 65408 file(s)
Filled part #15: size = 546032339031, 66658 file(s)
Filled part #16: size = 493759970422, 73142 file(s)
Filled part #17: size = 489034017302, 154668 file(s)
Filled part #18: size = 492438162214, 178504 file(s)
Filled part #19: size = 488127491874, 109397 file(s)
Filled part #20: size = 488210804913, 136003 file(s)
Filled part #21: size = 488125298794, 98261 file(s)
Filled part #22: size = 505097570795, 113938 file(s)
Filled part #23: size = 488117458189, 227673 file(s)
Filled part #24: size = 577134074354, 234131 file(s)
Filled part #25: size = 386946098762, 157991 file(s)
3157180 file(s) found.

It looks like here's a mismatch, since parsync is waiting for the 26th file but that's not being created.
Have you ever faced this kind of situation before?

Something about our environment:
[root@filler002 fpcache]# ps aux | grep -i parsyncfp
root 2858510 0.0 0.0 137324 6100 ? S Nov11 0:04 perl /usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aAX --chunksize=-488116876476 --altcache=/root/.parsyncfp-TMC --startdir=/mnt/beegfs/data TMC /mnt/cephfs/backup/data
root 2992869 0.0 0.0 112716 984 pts/1 S+ 09:06 0:00 grep --color=auto -i parsyncfp
[root@filler002 fpcache]# uname -r
3.10.0-1062.4.1.el7.x86_64
[root@filler002 fpcache]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root@filler002 fpcache]# parsyncfp -V
Option v requires an argument

parsyncfp version 1.60 (Dullnig)
June 24, 2019
by Harry Mangalam hjmangalam@gmail.com

How do you suggest to move forward from this point on?

Thank you in advance!

Harry Mangalam · Answer 1 · Tue Nov 12 2019 20:31:13 GMT+0800 (China Standard Time)

Hmm - that's a steaming big chunk of data. For my own info, what kind of network are you using and what kind of bandwidth are you seeing? And is this the first time you've used it or is this an unexpected glitch in an otherwise routine use of it? Also, thank you for providing such a nicely informative question. As someone who spent most of his life answering poorly worded questions, I appreciate it. There are a few reasons this might be the case. 1 - parsyncfp (pfp) does check whether it has overrun fpart and will wait for it - you may have seen the error message indicating this - often seen on mostly rsync'ed transfers bc there are NP rsync processes waiting to consume the efforts of a single fpart. There could be a bug in this process, but I haven't seen it yet. 2 - if it's an active filesystem and something happened to the files during the fpart recursive descent, fpart may have gotten confused and coughed up a digital hairball instead of the final chunkfile. The files that were supposed to go in the final file were deleted or moved as the file was being prepped. I have seen one case where this (supposedly) happened. This isn't a bug per se, but a result of file jitter. 3 - there was an off-by-one error in an older version of pfp where the this might have been the case. You're using the 1.60 version, I have it up to 1.61. Just in case, I'm including the latest version I have. Try that and see what happens. Because it's rsync-based, you shouldn't have lost any time except for the (long) fpart recursion. ie it should be 25/26 finished. Best harry

…

On Tuesday, November 12, 2019 12:08:29 AM PST Andrea Del Monaco wrote: Hi all, We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB) Since it looked suspicious, i ran an strace on parsync and this is what i noticed: ***@***.*** fpcache]# strace -ff -tt -s 4096 -p 2858510 strace: Process 2858510 attached 09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0 09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0 09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory) 09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs: ***@***.*** fpcache]# pwd /root/.parsyncfp-TMC/fpcache ***@***.*** fpcache]# ps aux | grep -i fpart root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart ***@***.*** fpcache]# ls -lah

Andrea Del Monaco · Answer 2 · Wed Nov 13 2019 16:49:01 GMT+0800 (China Standard Time)

Hi Harry,

Thank you for your fast reply! :)

Getting back to the issue, i tried to see the logs, but i don't see anything wrong:
[root@bezavrdat-master01 logs]# tac parsync-60784

exit 0
rm -rf /root/.parsyncfp-prius
Job finished well, and happy: cleaning up.
printf '%s\n' 'Job finished well, and happy: cleaning up.'
'[' 0 -eq 0 ']'
hjmangalam@gmail.com
Thanks for using parsyncfp. Tell me how to make it better.

if there were errors. Use '--verbose=1' for less output.
and the fpart log [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12]
Reminder: check the parsyncfp log [/root/.parsyncfp-prius/rsync-logfile-12.56.27_2019-11-12_8]

completed correctly, so you don't need the log files anymore.
Don't forget to delete it, but wait until you are sure that your job
The parsyncfp cache dir takes up [4.2M /root/.parsyncfp-prius]
INFO:
expected files are where they're supposed to be.
INFO: Done. Please check the target to make sure
12.57.15 0.05 1.85 0.00 / 0.00 2 <> 0 [8] of [8]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo]
| Elapsed | 1m | [ enp3s0] MB/s | Running || Susp'd | Chunks [2019-11-12]
INFO: Starting the 1st [8] rsyncs ..
INFO: Forking fpart. Check [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12] for errors if it hangs.
INFO: The fpart chunk files [/root/.parsyncfp-prius/fpcache/f*] are cleared .. continuing.
Otherwise, hit [Enter] and I'll clear them.
If you specified '--nowait', cache will be cleared in 3s regardless.
Enter ^C to stop this.
WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-prius/fpcache].
/usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aA --chunksize=-489178234094 --altcache=/root/.parsyncfp-prius --startdir=/mnt/beegfs/data prius /mnt/cephfs/backup/data
Found some left over files in /root/.parsyncfp-prius.
printf '%s\n' 'Found some left over files in /root/.parsyncfp-prius.'
'[' -d /root/.parsyncfp-prius ']'
dest_dir=data
base_dir=prius
++ basename /mnt/beegfs/data/prius
start_dir=/mnt/beegfs/data
++ dirname /mnt/beegfs/data/prius
chunck=489178234094
++ printf %.0f 489178234093.56
++ LC_NUMERIC=en_US.UTF-8
+++ echo 'scale=2;477713119232*1024/1000'
++++ tail -n1
++++ df --output=used /mnt/beegfs
+++ bc

(i used tac instead of cat, so it's backwards - that's because the log file was so big)
You will see that in the end completed - that's because i did touch f.x so it would quit the loop.

Regarding the chunk size: maybe some advice would be good. We have ~400TB of data to move around, therefore we did 400TB / 1000(number of parts). Do you think that's the right approach? At 2k we get the warning, so better not to risk it?

Another thing: actually, we have two shares mounted locally, therefore we don't use any network.

2 - if it's an active filesystem and something happened to the files during the fpart
recursive descent, fpart may have gotten confused and coughed up a digital hairball
instead of the final chunkfile. The files that were supposed to go in the final file were
deleted or moved as the file was being prepped. I have seen one case where this
(supposedly) happened. This isn't a bug per se, but a result of file jitter.
What can we do in this case?

Meanwhile, i am trying another run using version 1.61 - hope this works!

Andrea Del Monaco · Answer 3 · Wed Nov 13 2019 17:03:12 GMT+0800 (China Standard Time)

Short update:
I just restarted using 1.61 and outcome didn't change.
But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation.
Any idea why this could happen?

Harry Mangalam · Answer 4 · Wed Nov 13 2019 19:44:22 GMT+0800 (China Standard Time)

On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote: Short update: I just restarted using 1.61 and outcome didn't change. But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation. Any idea why this could happen?

The size and number of chunk files should be in the range that allows pfp to start up fast and keep going, and not so many that you end up with a bazillion chunks and large logs. It depends on the size of your data and the number of files; it also has implications for the number of stat()s that have to be done to track the chun files, so that argues for a lower number as well. Obviously, if you define few chunk files with a large data source, it will take a long time for fpart to fill all the chunks, leading to delayed startup. So I somewhat arbitrarily defined 2K as the limit for reasons of efficiency in scanning the single dir. But changing the # of chunks to 2K from 1K (if that's what you meant) should not have made a big change. Harry Mangalam, Info[1]

…

-------- [1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

Harry Mangalam · Answer 5 · Wed Nov 13 2019 19:48:11 GMT+0800 (China Standard Time)

On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote: Short update: I just restarted using 1.61 and outcome didn't change. But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation. Any idea why this could happen?

So your job (using 1.61) still hung at the end of the run? That's very odd. I'm traveling and short of time today, but after I get home, I'd like to take this up again. Harry Mangalam, Info[1]

…

-------- [1] http://moo.nac.uci.edu/~hjm/hjm.sig.html

Ryan Novosielski · Answer 6 · Thu Apr 23 2020 02:55:39 GMT+0800 (China Standard Time)

I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log:

[root@quorum03 fpcache]# cat fpart.log.23.01.14_2020-04-21
Examining filesystem...
Filled part #0: size = 11033969843194, 137378 file(s)
Filled part #1: size = 11010341095020, 96042 file(s)
Filled part #2: size = 10995456710513, 99478 file(s)
Filled part #3: size = 10997042249668, 57445 file(s)
Filled part #4: size = 10996205514131, 83592 file(s)
Filled part #5: size = 11235699669681, 72848 file(s)
Filled part #6: size = 10999193108455, 78821 file(s)
Filled part #7: size = 10999539432427, 35181 file(s)
Filled part #8: size = 11006485913550, 81898 file(s)
Filled part #9: size = 10995307950334, 109058 file(s)
Filled part #10: size = 11003985383714, 151601 file(s)
Filled part #11: size = 11123773319829, 99136 file(s)
Filled part #12: size = 10999398939451, 63104 file(s)
Filled part #13: size = 10997135371653, 47733 file(s)
Filled part #14: size = 10995837809948, 111756 file(s)
error parsing input values:
Filled part #15: size = 7276234302256, 105608 file(s)
1430679 file(s) found.

I'm not sure I checked this file the previous times, so I can't tell you if the problem correlates with the above error. It seems to me I've seen the latter without the former, but I'm not sure.

Then I'm seeing this in the parsyncfp output:

        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-21]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
05.34.05   422.72     1.72       5.00 / 0.00             3    <>   0          [16] of [16]
 INFO: Waiting [31502]s for next chunkfile..

In my experience, it will never finish waiting, and will increment the time forever. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that.

Harry Mangalam · Answer 7 · Sat Apr 25 2020 12:22:02 GMT+0800 (China Standard Time)

Hi Ryan, Apologies for the delay. I think that error is coming from fpart, not parsycfp and as such, I'm unfamiliar with the conditions that generated it. I'll get in touch with Ganael (fpart author) and see what he says. There are filename conditions that will cause even rsync to hack up a hairball occasionally, so it's probable that a similar situation could cause fpart to stutter. the fact that it started a little while ago and keeps repeating seems to imply that you have a file or dir somewhere that's causing fpart to die on it. The log entry you included (thank you): [[ Filled part #13: size = 10997135371653, 47733 file(s) Filled part #14: size = 10995837809948, 111756 file(s) error parsing input values: <<<<< Filled part #15: size = 7276234302256, 105608 file(s) 1430679 file(s) found. ]] at the '<<<<<' indicates a problem, but I'll have to debug with Ganael to see what could be causing it. The log does imply that fpart continued to run and fill another chunkfile correctly, but I'm not sure if the errors are being emitted and captured synchronously or not. The last chunkfile correctly (14 or 15 in the log above) filled should give an indication of what dir was being traversed when it died. Can you examine that dir and see if there's an oddball filename or damaged file lurking there? Best harry

…

On Wed, Apr 22, 2020 at 11:55 AM Ryan Novosielski ***@***.***> wrote: I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log: ***@***.*** fpcache]# cat fpart.log.23.01.14_2020-04-21 Examining filesystem... Filled part #0: size = 11033969843194, 137378 file(s) Filled part #1: size = 11010341095020, 96042 file(s) Filled part #2: size = 10995456710513, 99478 file(s) Filled part #3: size = 10997042249668, 57445 file(s) Filled part #4: size = 10996205514131, 83592 file(s) Filled part #5: size = 11235699669681, 72848 file(s) Filled part #6: size = 10999193108455, 78821 file(s) Filled part #7: size = 10999539432427, 35181 file(s) Filled part #8: size = 11006485913550, 81898 file(s) Filled part #9: size = 10995307950334, 109058 file(s) Filled part #10: size = 11003985383714, 151601 file(s) Filled part #11: size = 11123773319829, 99136 file(s) Filled part #12: size = 10999398939451, 63104 file(s) Filled part #13: size = 10997135371653, 47733 file(s) Filled part #14: size = 10995837809948, 111756 file(s) error parsing input values: Filled part #15: size = 7276234302256, 105608 file(s) 1430679 file(s) found. ...and then I'm seeing this in the parsyncfp output: | Elapsed | 1m | [ ens6] MB/s | Running || Susp'd | Chunks [2020-04-21] Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo] 05.34.05 422.72 1.72 5.00 / 0.00 3 <> 0 [16] of [16] INFO: Waiting [31502]s for next chunkfile.. In my experience, it will never finish waiting. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y6P5TBMR546VZ6WFJLRN44TTANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Ryan Novosielski · Answer 8 · Sat Apr 25 2020 12:39:57 GMT+0800 (China Standard Time)

Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing:

INFO: Waiting [239372]s for next chunkfile..

...and fpart is no longer running:

[root@quorum03 fpcache]# ps -ef | grep fpart
root     25927  4342  0 00:39 pts/1    00:00:00 grep --color=auto fpart

...or am I misunderstanding what you were saying?

Harry Mangalam · Answer 9 · Sat Apr 25 2020 12:46:17 GMT+0800 (China Standard Time)

Hmm - well since pfp called fpart, pfp should deal with fpart's unexpected death smoothly. It probably does not. I'll look into that. I thought it checked on every loop that the fpart PID was still running but I might have buggered that check. It's too late for coding for a geezer like me. I'll check after morning coffee. Again, thanks for using it and especially taking the time to write up the bug. Best Harry

…

On Fri, Apr 24, 2020 at 9:40 PM Ryan Novosielski ***@***.***> wrote: Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing: INFO: Waiting [239372]s for next chunkfile.. ...and fpart is no longer running: ***@***.*** fpcache]# ps -ef | grep fpart root 25927 4342 0 00:39 pts/1 00:00:00 grep --color=auto fpart ...or am I misunderstanding what you were saying? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3YZARVXKKWWBPNYJD4TROJSSRANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Ryan Novosielski · Answer 10 · Sat Apr 25 2020 12:55:38 GMT+0800 (China Standard Time)

Thank you! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. walking the FS).

Harry Mangalam · Answer 11 · Sat Apr 25 2020 23:45:49 GMT+0800 (China Standard Time)

Some additional info, post-coffee. Meandering thru the parsyncfp code.. That pfp warning: [[ INFO: Waiting [31502]s for next chunkfile.. ]] only is emitted if the specific fpart PID is still running: [[ $FPART_RUNNING = `ps ux | grep fpar[t] | grep $FPART_PID | wc -l`; chomp $FPART_RUNNING; while (! -e $CUR_FP_FLE && $FPART_RUNNING eq '1'){ if ($VERBOSE >= 2) {INFO("Waiting [$cfw]s for next chunkfile..\r"); sleep 2; $cfw += 2;} } ]] The next time you run pfp, you should be able to verify this by 'ps'. If that's the case, it may be an fpart-specific issue - fpart still running, but hung for some problem. You could check this by doing a strace on the fpart PID. Also, checking the fpart code, that error is supposed to name the file that it's having trouble with, but in your case, there's nothing, suggesting that possibly your file mistakenly got named as whitespace (space, tab, newline, etc). Is that a possibility? hjm

…

On Fri, Apr 24, 2020 at 9:55 PM Ryan Novosielski ***@***.***> wrote: Thank *you*! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. the FS). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y66TALNEYGZINEGGODROJUNJANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Ryan Novosielski · Answer 12 · Sun Apr 26 2020 07:01:42 GMT+0800 (China Standard Time)

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464
drwxr-xr-x 2 root root      300 Apr 21 23:42 .
drwxr-xr-x 3 root root     4096 Apr 21 23:42 ..
-rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
-rw-r--r-- 1 root root  8882982 Apr 21 23:01 f.1
-rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
-rw-r--r-- 1 root root  9335152 Apr 21 23:01 f.11
-rw-r--r-- 1 root root  5728710 Apr 21 23:42 f.12
-rw-r--r-- 1 root root  4113466 Apr 21 23:42 f.13
-rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
-rw-r--r-- 1 root root  9944264 Apr 21 23:42 f.15
-rw-r--r-- 1 root root  9261208 Apr 21 23:01 f.2
-rw-r--r-- 1 root root  5349997 Apr 21 23:01 f.3
-rw-r--r-- 1 root root  7775218 Apr 21 23:01 f.4
-rw-r--r-- 1 root root  6749460 Apr 21 23:01 f.5
-rw-r--r-- 1 root root  7301965 Apr 21 23:01 f.6
-rw-r--r-- 1 root root  3237238 Apr 21 23:01 f.7
-rw-r--r-- 1 root root  7605602 Apr 21 23:01 f.8
-rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
-rw-r--r-- 1 root root      935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
-rw-r--r-- 1 root root        5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
-rw-r--r-- 1 root root       84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
[root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l
0
[root@quorum03 fpcache]# ps aux | grep fpar
root     15263  0.0  0.0 112708   976 pts/1    S+   18:57   0:00 grep --color=auto fpar
[root@quorum03 fpcache]# ps aux | grep fpar[t]
[root@quorum03 fpcache]#

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

Harry Mangalam · Answer 13 · Sun Apr 26 2020 12:06:19 GMT+0800 (China Standard Time)

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period. Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry

…

On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski ***@***.***> wrote: I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*? If so: total 129464 drwxr-xr-x 2 root root 300 Apr 21 23:42 . drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l 0 ***@***.*** fpcache]# ps aux | grep fpar root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar ***@***.*** fpcache]# ps aux | grep fpar[t] Just for fun, I looked at what parsyncfp is doing: strace: Process 5439 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({2, 0}, 0x7ffee5b72460) = 0 stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Andrea Del Monaco · Answer 14 · Mon Apr 27 2020 16:11:13 GMT+0800 (China Standard Time)

Hi all, Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart. I know this isn't much, but it's all i remember for now :) Wish you all a pleasant day! Regards, Andrea On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam <notifications@github.com> wrote:

…

I think I found the reason you're seeing that bug - the fpart PID wasn't refreshed during the test, allowing the old PID to remain valid infinitely, hence the long waiting period. Try the new version (1.64 / Covid) to see if it solves your problem. It is odd that you are the only person to report this - I've never seen this problem before, so it may be a function of the particular set of files and dirs that you have that allowed this bug to manifest. Let me know.. Harry On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski ***@***.*** > wrote: > I'm still running the copy that I wrote about (this is a large transfer), > so I check whatever else you want without waiting till next time. There are > still rsyncs running so it doesn't hold me up any. You might have seen that > I mentioned above, nothing even containing FPART is running. I assume > $FPART_PID can also be found in the file called FP_PIDFILE*? > > If so: > > total 129464 > drwxr-xr-x 2 root root 300 Apr 21 23:42 . > drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. > -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 > -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 > -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 > -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 > -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 > -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 > -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 > -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 > -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 > -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 > -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 > -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 > -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 > -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 > -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 > -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 > -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 > -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 > -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 > ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l > 0 > ***@***.*** fpcache]# ps aux | grep fpar > root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar > ***@***.*** fpcache]# ps aux | grep fpar[t] > > Just for fun, I looked at what parsyncfp is doing: > > strace: Process 5439 attached > restart_syscall(<... resuming interrupted nanosleep ...>) = 0 > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory) > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > Looks like about what you'd expect, but why it thinks there's going to be > an f.16 at this point, I don't know. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > < #30 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ > > . > -- Harry Mangalam — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ> .

Harry Mangalam · Answer 15 · Tue Apr 28 2020 00:15:34 GMT+0800 (China Standard Time)

Hi Andrea, Thanks very much for chiming in. (In reading back thru the thread, I see that I neglected to answer some of your points, which may have addressed this bug as well - I'll take another look at them today or tomorrow). In talking to Ganael about the possible fpart bug, I think it's more likely a bug in the pfp code - specifically the one I referred to yesterday that could introduce a 'wait-forever' bug. I think I fixed it, but bc it tends to show up on very large transfers, I'm still trying to test it on situations that will fail otherwise. If you want to try the latest version on your large transfers, that would help as well. The only significant change was that fix. hjm On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco <notifications@github.com> wrote:

…

Hi all, Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart. I know this isn't much, but it's all i remember for now :) Wish you all a pleasant day! Regards, Andrea On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam ***@***.***> wrote: > I think I found the reason you're seeing that bug - the fpart PID wasn't > refreshed during the test, allowing the old PID to remain valid infinitely, > hence the long waiting period. > > Try the new version (1.64 / Covid) to see if it solves your problem. It is > odd that you are the only person to report this - I've never seen this > problem before, so it may be a function of the particular set of files and > dirs that you have that allowed this bug to manifest. > Let me know.. > Harry > > > On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski < ***@***.*** > > > wrote: > > > I'm still running the copy that I wrote about (this is a large transfer), > > so I check whatever else you want without waiting till next time. There > are > > still rsyncs running so it doesn't hold me up any. You might have seen > that > > I mentioned above, nothing even containing FPART is running. I assume > > $FPART_PID can also be found in the file called FP_PIDFILE*? > > > > If so: > > > > total 129464 > > drwxr-xr-x 2 root root 300 Apr 21 23:42 . > > drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. > > -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 > > -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 > > -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 > > -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 > > -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 > > -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 > > -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 > > -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 > > -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 > > -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 > > -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 > > -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 > > -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 > > -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 > > -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 > > -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 > > -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 > > -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 > > -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 > > ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat > FP_PIDFILE23.01.14_2020-04-21) | wc -l > > 0 > > ***@***.*** fpcache]# ps aux | grep fpar > > root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar > > ***@***.*** fpcache]# ps aux | grep fpar[t] > > > > Just for fun, I looked at what parsyncfp is doing: > > > > strace: Process 5439 attached > > restart_syscall(<... resuming interrupted nanosleep ...>) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > > Looks like about what you'd expect, but why it thinks there's going to be > > an f.16 at this point, I don't know. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < > #30 (comment) >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ > > > > . > > > > > -- > > Harry Mangalam > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > < #30 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Harry Mangalam · Answer 16 · Wed Apr 29 2020 05:09:54 GMT+0800 (China Standard Time)

Hi All, Pretty sure (touch wood) that this last set of changes has addressed this 'failure to terminate' bug. I've tested it on some TB/8M file dirs and it seems to be behaving itself. In examining the code, I've also identified a number of changes that will speed up the start and especially increase the speed of sync'ing substantially resync'ed dirs. Those will require some substantial changes so I've not included them in the current 1.65. Please give it a try and see if this fixes the bugs you identified. Again, thanks for your help in making it better. Harry On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco <notifications@github.com> wrote:

…

Hi all, Sorry to chime in, but i think that this might (maybe) help. We've been seeing the same behavior, when doing a copy of many TB and when most of these files were so many small files. I think that for some reason, this breaks fpart. I know this isn't much, but it's all i remember for now :) Wish you all a pleasant day! Regards, Andrea On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam ***@***.***> wrote: > I think I found the reason you're seeing that bug - the fpart PID wasn't > refreshed during the test, allowing the old PID to remain valid infinitely, > hence the long waiting period. > > Try the new version (1.64 / Covid) to see if it solves your problem. It is > odd that you are the only person to report this - I've never seen this > problem before, so it may be a function of the particular set of files and > dirs that you have that allowed this bug to manifest. > Let me know.. > Harry > > > On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski < ***@***.*** > > > wrote: > > > I'm still running the copy that I wrote about (this is a large transfer), > > so I check whatever else you want without waiting till next time. There > are > > still rsyncs running so it doesn't hold me up any. You might have seen > that > > I mentioned above, nothing even containing FPART is running. I assume > > $FPART_PID can also be found in the file called FP_PIDFILE*? > > > > If so: > > > > total 129464 > > drwxr-xr-x 2 root root 300 Apr 21 23:42 . > > drwxr-xr-x 3 root root 4096 Apr 21 23:42 .. > > -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0 > > -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1 > > -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10 > > -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11 > > -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12 > > -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13 > > -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14 > > -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15 > > -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2 > > -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3 > > -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4 > > -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5 > > -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6 > > -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7 > > -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8 > > -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9 > > -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21 > > -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21 > > -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21 > > ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat > FP_PIDFILE23.01.14_2020-04-21) | wc -l > > 0 > > ***@***.*** fpcache]# ps aux | grep fpar > > root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar > > ***@***.*** fpcache]# ps aux | grep fpar[t] > > > > Just for fun, I looked at what parsyncfp is doing: > > > > strace: Process 5439 attached > > restart_syscall(<... resuming interrupted nanosleep ...>) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > nanosleep({2, 0}, 0x7ffee5b72460) = 0 > > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No > such file or directory) > > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > > Looks like about what you'd expect, but why it thinks there's going to be > > an f.16 at this point, I don't know. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > < > #30 (comment) >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ > > > > . > > > > > -- > > Harry Mangalam > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > < #30 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ> .

-- Harry Mangalam

Ryan Novosielski · Answer 17 · Mon May 04 2020 23:41:52 GMT+0800 (China Standard Time)

I haven't had a chance to try the newer version yet, which I should be able to do try for tomorrow's backup.

FYI, this week running 1.61, all three campuses eventually wound up in the endless "waiting for chunkfile" state:

02.40.06   6569.20     1.03      42.09 / 0.00            10    <>   0          [43] of [43]
03.24.38   6613.73     1.59      28.85 / 0.00            10    <>   0          [43] of [43]
04.09.12   6658.30     1.15      48.02 / 0.00            10    <>   0          [43] of [43]
04.53.54   6703.00     1.74      40.93 / 0.00            10    <>   0          [43] of [43]
05.38.39   6747.75     1.58      30.90 / 0.00            10    <>   0          [43] of [43]
06.23.00   6792.10     1.15      28.70 / 0.00            10    <>   0          [43] of [43]

==> /var/log/asb-parsyncfp-20200429.err <==
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]

==> /var/log/asb-parsyncfp-20200429.log <==
 INFO: Waiting [16872]s for next chunkfile..

As you can see, this one ran for a really long time before that happened. I think that this problem appeared between 1.55 and 1.61, but it's hard to be sure, since the content could have changed instead and caused the problem (maybe a little less likely given that it happened on all campuses).