pkolaczk / fclones

Efficient Duplicate File Finder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Limit deduplication to files that have not been deduplicated yet

patrickwolf opened this issue · comments

This is a feature request for the fclones dedupe feature.

Currently on each run it creates new reflinks for files even if they have been already deduplicated. This also means that the storage estimates around how much space is wasted are off.

Seems like there are at least two solutions:

  1. Write to the cache if a file has been deduplicated and not attempt it again (this could also fix the storage estimate)
  2. Check the extends of each file to verify if they have been already deduplicated and only attempt it again if they aren't fully deduplicated yet

For solution 2) here are some ways that could work

root@ubuntu1:/ex2/_Data# filefrag -v fclones.json fclones2.json
Filesystem type is: 9123683e
File size of fclones.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones.json: 2 extents found
File size of fclones2.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones2.json: 2 extents found
root@ubuntu1:/ex2/_Data#

Ref:

The cache might be easier to start with and checking the extends cooler :) and more future proof

Thanks for considering it

Using the (existing) cache is the practical approach since it would not require adding more low-level linux syscalls.

@pkolaczk what do you think of adding deduplication information to the cache?