Limit deduplication to files that have not been deduplicated yet

Question

Limit deduplication to files that have not been deduplicated yet

patrickwolf opened this issue a year ago · comments

This is a feature request for the fclones dedupe feature.

Currently on each run it creates new reflinks for files even if they have been already deduplicated. This also means that the storage estimates around how much space is wasted are off.

Seems like there are at least two solutions:

Write to the cache if a file has been deduplicated and not attempt it again (this could also fix the storage estimate)
Check the extends of each file to verify if they have been already deduplicated and only attempt it again if they aren't fully deduplicated yet

For solution 2) here are some ways that could work

root@ubuntu1:/ex2/_Data# filefrag -v fclones.json fclones2.json
Filesystem type is: 9123683e
File size of fclones.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones.json: 2 extents found
File size of fclones2.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones2.json: 2 extents found
root@ubuntu1:/ex2/_Data#

Ref:

Go tool to figure out shared extends
Article on steps to figure out shared extends

The cache might be easier to start with and checking the extends cooler :) and more future proof

Thanks for considering it

Thomas Otto · Answer 1 · Wed Apr 19 2023 06:03:59 GMT+0800 (China Standard Time)

Using the (existing) cache is the practical approach since it would not require adding more low-level linux syscalls.

Patrick Wolf · Answer 2 · Tue Apr 25 2023 23:19:25 GMT+0800 (China Standard Time)

@pkolaczk what do you think of adding deduplication information to the cache?