Why does running a group after a dedupe hash everything again?

Question

Why does running a group after a dedupe hash everything again?

patrickwolf opened this issue a year ago · comments

Running Group on 130TB takes ~2 days
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**' 

Running it again takes 15 minutes
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**' 

Doing deduplication takes 10 minutes
fclones dedupe --path '/ex2/Reviews/**' -o /ex2/_Data/fclones_dd.txt --priority least-recently-modified < /ex2/_Data/fclones.json

**Running group again takes 1+ day**
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**'

Why is it that after a dedupe that files need to re-hashed? Running dedupe should have reduced the amount of files that are not the same instead of increasing it right?

Environment is Synology, BTRFS, 130TB RAID 5

Thomas Otto · Answer 1 · Mon Apr 03 2023 05:57:26 GMT+0800 (China Standard Time)

On Linux fclones did not restore the timestamps of deduped files. This means the cache - which among other information looks at the mtime - for these entries was invalidated. This should be fixed with #194.

Patrick Wolf · Answer 2 · Wed Apr 12 2023 05:27:03 GMT+0800 (China Standard Time)

Great thank you @th1000s ! I imagine your internal code change is the same as using the -P option on cp? ie "cp --relink=always -P"?

Thomas Otto · Answer 3 · Wed Apr 19 2023 06:01:56 GMT+0800 (China Standard Time)

It is more like --preserve=timestamps, -p, or more practical -a / archive mode.

Piotr Kołaczkowski · Answer 4 · Mon Jun 05 2023 14:58:52 GMT+0800 (China Standard Time)

This should be fixed now in 0.31.0.