hjmangalam / parsyncfp

follow-on to parsync (parallel rsync) with better startup perf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A couple of possibly related problems with parsyncfp 1.52

novosirj opened this issue · comments

Have noticed a little bit of strange behavior recently with parsyncfp 1.52:

  1. We use parsyncfp on a very large filesystem (585TB) and do the following command:

/usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts '--stats -e "ssh -T -c arcfour -o Compression=no -x"' --maxbw=1250000 --startdir '/gpfs/home/.snapshots' home nas1:/zdata/gss

We have sort of widely varying file sizes on this FS, and this arragement typically yields in the neighborhood of 2500 chunks, containing anywhere between 2 and 675347 files apiece. However, I've noticed that the number of chunks reported by parsyncfp is larger than the number of chunks that exist, at least during this run and a number of runs before it. From the output:

[root@quorum fpcache]# grep -B23 "INFO: Done" /var/log/asb.log
        | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-21]
  Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
16.41.47   3037.82     0.11       0.00        1    <>   0          [2472] of [2472]
16.41.51   3037.88     0.11       0.00        1    <>   0          [2472] of [2472]
16.41.54   3037.93     0.10       0.00        1    <>   0          [2472] of [2472]
16.41.57   3037.98     0.09       0.00        1    <>   0          [2472] of [2472]
16.42.00   3038.03     0.09       0.00        1    <>   0          [2472] of [2472]
16.42.03   3038.08     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.06   3038.13     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.10   3038.20     0.08       0.00        1    <>   0          [2472] of [2472]
16.42.13   3038.25     0.07       0.00        1    <>   0          [2472] of [2472]
16.42.16   3038.30     0.07       0.00        1    <>   0          [2472] of [2472]
16.42.19   3038.35     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.22   3038.40     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.25   3038.45     0.06       0.00        1    <>   0          [2472] of [2472]
16.42.29   3038.52     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.32   3038.57     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.35   3038.62     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.38   3038.67     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
16.42.48   3038.83     0.04       0.00        0    <>   0          [2472] of [2472]

INFO: Done.  Please check the target to make sure 

...but then the fpart.log says there aren't that many:

[root@quorum fpcache]# tail -5 fpart.log.13.49.48_2018-06-19 
Filled part #2468: size = 107551197313, 231318 file(s)
Filled part #2469: size = 107514041195, 393 file(s)
Filled part #2470: size = 117863687441, 853224 file(s)
Filled part #2471: size = 50532682569, 132844 file(s)
141925735 file(s) found.

In at least once instance, I've found that it was actually an earlier chunk that it was still working on -- for example, I discovered that one chunk had a 95T sparse file in it (chunk #2462), but the final output before I terminated the rsync said:

         | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-07]
   Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
16.44.37   3219.03     1.33       0.00        1    <>   0          [2483] of [2483]
 
INFO: Done.  Please check the target to make sure
  1. Today, parsyncfp appears to have been running for most of the day -- since 09:06 this morning -- without there actually having been any rsync processes associated with it:
        | Elapsed |   1m   |    MB/s   | Running || Susp'd  |      Chunks       [2018-06-21]
  Time  | time(m) |  Load  | [  bond0] |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
09.06.34   2582.60     0.39     101.50        2    <>   0          [2472] of [2472]
09.06.37   2582.65     0.44     207.66        2    <>   0          [2472] of [2472]
09.06.40   2582.70     0.44     232.15        2    <>   0          [2472] of [2472]
09.06.43   2582.75     0.49     138.49        2    <>   0          [2472] of [2472]

Number of files: 873431
Number of files transferred: 266
Total file size: 117863698434 bytes
Total transferred file size: 975369047 bytes
Literal data: 847069151 bytes
Matched data: 128299896 bytes
File list size: 17157260
File list generation time: 1301.935 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 864491626
Total bytes received: 388752

sent 864491626 bytes  received 388752 bytes  618434.31 bytes/sec
total size is 117863698434  speedup is 136.28
09.06.47   2582.82     0.53      25.52        2    <>   0          [2472] of [2472]
09.06.50   2582.87     0.53       0.00        1    <>   0          [2472] of [2472]
09.06.53   2582.92     0.49       0.00        1    <>   0          [2472] of [2472]
09.06.56   2582.97     0.49       0.00        1    <>   0          [2472] of [2472]
09.06.59   2583.02     0.45       0.00        1    <>   0          [2472] of [2472]
09.07.02   2583.07     0.41       0.00        1    <>   0          [2472] of [2472]
09.07.06   2583.13     0.41       0.00        1    <>   0          [2472] of [2472]
09.07.09   2583.18     0.38       0.00        1    <>   0          [2472] of [2472]
09.07.12   2583.23     0.35       0.00        1    <>   0          [2472] of [2472]
09.07.15   2583.28     0.35       0.00        1    <>   0          [2472] of [2472]
09.07.18   2583.33     0.32       0.00        1    <>   0          [2472] of [2472]
09.07.21   2583.38     0.37       0.00        1    <>   0          [2472] of [2472]
09.07.25   2583.45     0.37       0.00        1    <>   0          [2472] of [2472]
09.07.28   2583.50     0.50       0.00        1    <>   0          [2472] of [2472]
09.07.31   2583.55     0.50       0.00        1    <>   0          [2472] of [2472]
09.07.34   2583.60     0.46       0.00        1    <>   0          [2472] of [2472]
09.07.37   2583.65     0.43       0.00        1    <>   0          [2472] of [2472]

...

16.42.38   3038.67     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
16.42.48   3038.83     0.04       0.00        0    <>   0          [2472] of [2472] 

I started looking into this sometime around 16:40 and there were no rsyncs running, despite what this output said. I'm not entirely sure what caused it to stop at 16:42. I took a look through the log at the period of time during that chunk where only 1 rsync was running. Only 3 lines are above 0.00 MB/s:

[root@quorum fpcache]# grep "\\[2472\\] of \\[2472\\]" /var/log/asb.log | grep "1    <>" | sort -n -k4 | tail -5
16.42.41   3038.72     0.05       0.00        1    <>   0          [2472] of [2472]
16.42.44   3038.77     0.04       0.00        1    <>   0          [2472] of [2472]
12.57.57   2813.98     0.09       0.04        1    <>   0          [2472] of [2472]
12.58.00   2814.03     0.09       0.04        1    <>   0          [2472] of [2472]
12.13.13   2769.25     0.09       0.25        1    <>   0          [2472] of [2472]

Am I onto something here, either as two separate problems or one related issue with a couple of symptoms?

Just want to end by saying thank you for your great tool!

I'm experiencing the problem right now where it sits around waiting for the last rsync to exit but there's not a single one running and it also reports it's working on a step that doesn't exist:

[root@quorum ~]# ps -ef | grep rsync
root      6906  6904  3 Jul18 pts/0    01:23:03 perl /usr/local/bin/parsyncfp -NP 4 --chunksize=100G --rsyncopts --stats -e "ssh -T -c arcfour -o Compression=no -x" --maxbw=1250000 --startdir /gpfs/home/.snapshots home nas1:/zdata/gss
root     22357  9236  0 22:32 pts/6    00:00:00 grep --color=auto rsync
[root@quorum ~]# 

Any idea? What additional info might be helpful?

Looks like maybe a couple of PIDs that are in the rsync PID list either got reused or incorrectly picked up for inclusion in the rsync PID list?

[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root     24309   930  0 22:43 ?        00:00:00 sleep 60
root     20472     2  0 21:46 ?        00:00:00 [kworker/8:2]
root     24848     2  0 22:11 ?        00:00:00 [kworker/9:1]
[root@quorum ~]# ps -fp 930
UID        PID  PPID  C STIME TTY          TIME CMD
root       930     1  0 Jun27 ?        00:00:36 /bin/bash /usr/sbin/ksmtuned

Looks like more likely the former. A few minutes, the list has changed some:

[root@quorum ~]# cat rsync-PIDs-02.30.22_2018-07-18 | while read pid; do ps -fp $pid | grep -v UID.*PID ; done
root      9954     2  0 22:47 ?        00:00:00 [kworker/1:1]
root     20472     2  0 21:46 ?        00:00:00 [kworker/8:2]
root     24848     2  0 22:11 ?        00:00:00 [kworker/9:1]

I've got this condition going on right now still. So if there's any more data it would be helpful for you to have, I can provide it now as it's still stuck. Seems always to manage to start a new process to collide with the process list.

If there's no information you can use, any tips on how to get it to successfully complete? :-D

In our case, it's really no big deal. Since it actually completed and it's acting on a snapshot, it makes no real difference if it completes successfully (in fact, it's hard to tell the difference, looking now at the logfile after I killed it), and you really need to run an rsync --delete outside of parsyncfp to pick up deletions anyway.

Thanks for your help and for this great tool!

I have found a new way this situation with the PIDs causes problems: right now, there are only 2 rsync processes -- less than the allowed 4:

[root@quorum ~]# pgrep rsync
20803
20814

However parsyncfp reports 6:

00.06.35 2176.58 2.07 0.01 6 <> 0 [2717] of [2822]

...which I assume is hurting the transfer speed (I generally allow for 4 of them).

I suspect that this is related:

[root@quorum ~]# wc -l ~/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28 
2717 /root/.parsyncfp-backups/fpcache/rsync-PIDs-11.36.53_2018-08-28

I haven't read that part of the source code carefully, but is it really checking to see if any of those 2717 PIDs are running? If so, I bet that's going to be pretty frequently true.