hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fpart and parasync mismatching parts

ADiEmme opened this issue · comments

Hi all,

We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB)
Since it looked suspicious, i ran an strace on parsync and this is what i noticed:
[root@filler002 fpcache]# strace -ff -tt -s 4096 -p 2858510
strace: Process 2858510 attached
09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0
09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached

So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs:
[root@filler002 fpcache]# pwd
/root/.parsyncfp-TMC/fpcache
[root@filler002 fpcache]# ps aux | grep -i fpart
root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart
[root@filler002 fpcache]# ls -lah
total 35M
drwxr-xr-x 2 root root 4.0K Nov 11 14:05 .
drwxr-xr-x 3 root root 4.0K Nov 11 14:05 ..
-rw-r--r-- 1 root root 351K Nov 11 13:54 f.0
-rw-r--r-- 1 root root 930K Nov 11 13:54 f.1
-rw-r--r-- 1 root root 1.1M Nov 11 13:58 f.10
-rw-r--r-- 1 root root 2.4M Nov 11 13:59 f.11
-rw-r--r-- 1 root root 2.0M Nov 11 13:59 f.12
-rw-r--r-- 1 root root 1.6M Nov 11 13:59 f.13
-rw-r--r-- 1 root root 2.7M Nov 11 13:59 f.14
-rw-r--r-- 1 root root 1.9M Nov 11 14:00 f.15
-rw-r--r-- 1 root root 1.2M Nov 11 14:00 f.16
-rw-r--r-- 1 root root 3.0M Nov 11 14:00 f.17
-rw-r--r-- 1 root root 1.8M Nov 11 14:00 f.18
-rw-r--r-- 1 root root 803K Nov 11 14:01 f.19
-rw-r--r-- 1 root root 12K Nov 11 13:55 f.2
-rw-r--r-- 1 root root 1.9M Nov 11 14:02 f.20
-rw-r--r-- 1 root root 567K Nov 11 14:02 f.21
-rw-r--r-- 1 root root 1.8M Nov 11 14:03 f.22
-rw-r--r-- 1 root root 2.7M Nov 11 14:03 f.23
-rw-r--r-- 1 root root 1.1M Nov 11 14:04 f.24
-rw-r--r-- 1 root root 1.6M Nov 11 14:05 f.25
-rw-r--r-- 1 root root 1.2M Nov 11 13:56 f.3
-rw-r--r-- 1 root root 493K Nov 11 13:57 f.4
-rw-r--r-- 1 root root 721K Nov 11 13:57 f.5
-rw-r--r-- 1 root root 14K Nov 11 13:57 f.6
-rw-r--r-- 1 root root 64K Nov 11 13:58 f.7
-rw-r--r-- 1 root root 452K Nov 11 13:58 f.8
-rw-r--r-- 1 root root 2.6M Nov 11 13:58 f.9
-rw-r--r-- 1 root root 1.4K Nov 11 14:06 fpart.log.13.54.12_2019-11-11
-rw-r--r-- 1 root root 8 Nov 11 13:54 FP_PIDFILE13.54.12_2019-11-11
-rw-r--r-- 1 root root 208 Nov 11 14:05 rsync-PIDs-13.54.12_2019-11-11

[root@filler002 fpcache]# cat fpart.log.13.54.12_2019-11-11
Examining filesystem...
Filled part #0: size = 488118230349, 119714 file(s)
Filled part #1: size = 488126329044, 117869 file(s)
Filled part #2: size = 488119026333, 329636 file(s)
Filled part #3: size = 488195328885, 189742 file(s)
Filled part #4: size = 489718855986, 126551 file(s)
Filled part #5: size = 523602512613, 92303 file(s)
Filled part #6: size = 507470117001, 26582 file(s)
Filled part #7: size = 564502186104, 115597 file(s)
Filled part #8: size = 552879767274, 66996 file(s)
Filled part #9: size = 488970973281, 39277 file(s)
Filled part #10: size = 506558236575, 120349 file(s)
Filled part #11: size = 488124669226, 60410 file(s)
Filled part #12: size = 507594913183, 64456 file(s)
Filled part #13: size = 488121789469, 71924 file(s)
Filled part #14: size = 488117693970, 65408 file(s)
Filled part #15: size = 546032339031, 66658 file(s)
Filled part #16: size = 493759970422, 73142 file(s)
Filled part #17: size = 489034017302, 154668 file(s)
Filled part #18: size = 492438162214, 178504 file(s)
Filled part #19: size = 488127491874, 109397 file(s)
Filled part #20: size = 488210804913, 136003 file(s)
Filled part #21: size = 488125298794, 98261 file(s)
Filled part #22: size = 505097570795, 113938 file(s)
Filled part #23: size = 488117458189, 227673 file(s)
Filled part #24: size = 577134074354, 234131 file(s)
Filled part #25: size = 386946098762, 157991 file(s)
3157180 file(s) found.

It looks like here's a mismatch, since parsync is waiting for the 26th file but that's not being created.
Have you ever faced this kind of situation before?

Something about our environment:
[root@filler002 fpcache]# ps aux | grep -i parsyncfp
root 2858510 0.0 0.0 137324 6100 ? S Nov11 0:04 perl /usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aAX --chunksize=-488116876476 --altcache=/root/.parsyncfp-TMC --startdir=/mnt/beegfs/data TMC /mnt/cephfs/backup/data
root 2992869 0.0 0.0 112716 984 pts/1 S+ 09:06 0:00 grep --color=auto -i parsyncfp
[root@filler002 fpcache]# uname -r
3.10.0-1062.4.1.el7.x86_64
[root@filler002 fpcache]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root@filler002 fpcache]# parsyncfp -V
Option v requires an argument

parsyncfp version 1.60 (Dullnig)
June 24, 2019
by Harry Mangalam hjmangalam@gmail.com

How do you suggest to move forward from this point on?

Thank you in advance!

Hi Harry,

Thank you for your fast reply! :)

Getting back to the issue, i tried to see the logs, but i don't see anything wrong:
[root@bezavrdat-master01 logs]# tac parsync-60784

  • exit 0

  • rm -rf /root/.parsyncfp-prius
    Job finished well, and happy: cleaning up.

  • printf '%s\n' 'Job finished well, and happy: cleaning up.'

  • '[' 0 -eq 0 ']'
    hjmangalam@gmail.com
    Thanks for using parsyncfp. Tell me how to make it better.

    if there were errors. Use '--verbose=1' for less output.
    and the fpart log [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12]
    Reminder: check the parsyncfp log [/root/.parsyncfp-prius/rsync-logfile-12.56.27_2019-11-12_8]

    completed correctly, so you don't need the log files anymore.
    Don't forget to delete it, but wait until you are sure that your job
    The parsyncfp cache dir takes up [4.2M /root/.parsyncfp-prius]
    INFO:
    expected files are where they're supposed to be.
    INFO: Done. Please check the target to make sure
    12.57.15 0.05 1.85 0.00 / 0.00 2 <> 0 [8] of [8]
    Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo]
    | Elapsed | 1m | [ enp3s0] MB/s | Running || Susp'd | Chunks [2019-11-12]
    INFO: Starting the 1st [8] rsyncs ..
    INFO: Forking fpart. Check [/root/.parsyncfp-prius/fpcache/fpart.log.12.56.27_2019-11-12] for errors if it hangs.
    INFO: The fpart chunk files [/root/.parsyncfp-prius/fpcache/f*] are cleared .. continuing.
    Otherwise, hit [Enter] and I'll clear them.
    If you specified '--nowait', cache will be cleared in 3s regardless.
    Enter ^C to stop this.
    WARN: About to remove all the old cached chunkfiles from [/root/.parsyncfp-prius/fpcache].

  • /usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aA --chunksize=-489178234094 --altcache=/root/.parsyncfp-prius --startdir=/mnt/beegfs/data prius /mnt/cephfs/backup/data
    Found some left over files in /root/.parsyncfp-prius.

  • printf '%s\n' 'Found some left over files in /root/.parsyncfp-prius.'

  • '[' -d /root/.parsyncfp-prius ']'

  • dest_dir=data

  • base_dir=prius
    ++ basename /mnt/beegfs/data/prius

  • start_dir=/mnt/beegfs/data
    ++ dirname /mnt/beegfs/data/prius

  • chunck=489178234094
    ++ printf %.0f 489178234093.56
    ++ LC_NUMERIC=en_US.UTF-8
    +++ echo 'scale=2;477713119232*1024/1000'
    ++++ tail -n1
    ++++ df --output=used /mnt/beegfs
    +++ bc

(i used tac instead of cat, so it's backwards - that's because the log file was so big)
You will see that in the end completed - that's because i did touch f.x so it would quit the loop.

Regarding the chunk size: maybe some advice would be good. We have ~400TB of data to move around, therefore we did 400TB / 1000(number of parts). Do you think that's the right approach? At 2k we get the warning, so better not to risk it?

Another thing: actually, we have two shares mounted locally, therefore we don't use any network.

2 - if it's an active filesystem and something happened to the files during the fpart
recursive descent, fpart may have gotten confused and coughed up a digital hairball
instead of the final chunkfile. The files that were supposed to go in the final file were
deleted or moved as the file was being prepped. I have seen one case where this
(supposedly) happened. This isn't a bug per se, but a result of file jitter.
What can we do in this case?

Meanwhile, i am trying another run using version 1.61 - hope this works!

Short update:
I just restarted using 1.61 and outcome didn't change.
But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to have improved the situation.
Any idea why this could happen?

I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log:

[root@quorum03 fpcache]# cat fpart.log.23.01.14_2020-04-21
Examining filesystem...
Filled part #0: size = 11033969843194, 137378 file(s)
Filled part #1: size = 11010341095020, 96042 file(s)
Filled part #2: size = 10995456710513, 99478 file(s)
Filled part #3: size = 10997042249668, 57445 file(s)
Filled part #4: size = 10996205514131, 83592 file(s)
Filled part #5: size = 11235699669681, 72848 file(s)
Filled part #6: size = 10999193108455, 78821 file(s)
Filled part #7: size = 10999539432427, 35181 file(s)
Filled part #8: size = 11006485913550, 81898 file(s)
Filled part #9: size = 10995307950334, 109058 file(s)
Filled part #10: size = 11003985383714, 151601 file(s)
Filled part #11: size = 11123773319829, 99136 file(s)
Filled part #12: size = 10999398939451, 63104 file(s)
Filled part #13: size = 10997135371653, 47733 file(s)
Filled part #14: size = 10995837809948, 111756 file(s)
error parsing input values:
Filled part #15: size = 7276234302256, 105608 file(s)
1430679 file(s) found.

I'm not sure I checked this file the previous times, so I can't tell you if the problem correlates with the above error. It seems to me I've seen the latter without the former, but I'm not sure.

Then I'm seeing this in the parsyncfp output:

        | Elapsed |   1m   |    [   ens6]   MB/s  | Running || Susp'd  |      Chunks       [2020-04-21]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
05.34.05   422.72     1.72       5.00 / 0.00             3    <>   0          [16] of [16]
 INFO: Waiting [31502]s for next chunkfile..

In my experience, it will never finish waiting, and will increment the time forever. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that.

Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing:

INFO: Waiting [239372]s for next chunkfile..

...and fpart is no longer running:

[root@quorum03 fpcache]# ps -ef | grep fpart
root     25927  4342  0 00:39 pts/1    00:00:00 grep --color=auto fpart

...or am I misunderstanding what you were saying?

Thank you! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. walking the FS).

I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*?

If so:

total 129464
drwxr-xr-x 2 root root      300 Apr 21 23:42 .
drwxr-xr-x 3 root root     4096 Apr 21 23:42 ..
-rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
-rw-r--r-- 1 root root  8882982 Apr 21 23:01 f.1
-rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
-rw-r--r-- 1 root root  9335152 Apr 21 23:01 f.11
-rw-r--r-- 1 root root  5728710 Apr 21 23:42 f.12
-rw-r--r-- 1 root root  4113466 Apr 21 23:42 f.13
-rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
-rw-r--r-- 1 root root  9944264 Apr 21 23:42 f.15
-rw-r--r-- 1 root root  9261208 Apr 21 23:01 f.2
-rw-r--r-- 1 root root  5349997 Apr 21 23:01 f.3
-rw-r--r-- 1 root root  7775218 Apr 21 23:01 f.4
-rw-r--r-- 1 root root  6749460 Apr 21 23:01 f.5
-rw-r--r-- 1 root root  7301965 Apr 21 23:01 f.6
-rw-r--r-- 1 root root  3237238 Apr 21 23:01 f.7
-rw-r--r-- 1 root root  7605602 Apr 21 23:01 f.8
-rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
-rw-r--r-- 1 root root      935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
-rw-r--r-- 1 root root        5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
-rw-r--r-- 1 root root       84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
[root@quorum03 fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l
0
[root@quorum03 fpcache]# ps aux | grep fpar
root     15263  0.0  0.0 112708   976 pts/1    S+   18:57   0:00 grep --color=auto fpar
[root@quorum03 fpcache]# ps aux | grep fpar[t]
[root@quorum03 fpcache]#

Just for fun, I looked at what parsyncfp is doing:

strace: Process 5439 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460)       = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know.

I haven't had a chance to try the newer version yet, which I should be able to do try for tomorrow's backup.

FYI, this week running 1.61, all three campuses eventually wound up in the endless "waiting for chunkfile" state:

02.40.06   6569.20     1.03      42.09 / 0.00            10    <>   0          [43] of [43]
03.24.38   6613.73     1.59      28.85 / 0.00            10    <>   0          [43] of [43]
04.09.12   6658.30     1.15      48.02 / 0.00            10    <>   0          [43] of [43]
04.53.54   6703.00     1.74      40.93 / 0.00            10    <>   0          [43] of [43]
05.38.39   6747.75     1.58      30.90 / 0.00            10    <>   0          [43] of [43]
06.23.00   6792.10     1.15      28.70 / 0.00            10    <>   0          [43] of [43]

==> /var/log/asb-parsyncfp-20200429.err <==
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]

==> /var/log/asb-parsyncfp-20200429.log <==
 INFO: Waiting [16872]s for next chunkfile..

As you can see, this one ran for a really long time before that happened. I think that this problem appeared between 1.55 and 1.61, but it's hard to be sure, since the content could have changed instead and caused the problem (maybe a little less likely given that it happened on all campuses).