Sort file chunks instead of hashing

Question

Sort file chunks instead of hashing

verhovsky opened this issue 6 months ago · comments

Boris Verkhovskiy commented 6 months ago

My suggestion is to try a different algorithm. After you've grouped the files by size (step 3), instead of hashing the entire file in each group, you should instead read all the files in a group CHUNK_SIZE bytes at a time into an array and then essentially sort that array. If all the chunks are the same it would just be a linear iteration of comparisons over the bytes. If they're not the same you would split into new groups at that very chunk, if one of the groups has only one file after this sorting + re-grouping step then you don't need to process that file further since it has no duplicates.

In fact you don't need to sort, you just need different groups of identical bytes to cluster together, but computationally this is equivalent to sorting.

CHUNK_SIZE would have to be set dynamically depending on the number of members in the group of files and the amount of available memory. If there's too many files in a group then I guess you have to hash each file or you'll run out of memory.

Why I think this will be faster:

surely most files that differ, differ in the first 1000 bytes (at least for a lot of use-cases). early stopping would mean you don't have to read the entire file if you figure out it's unique just from the first few bytes
sorting is linear time if the array is already sorted (and if all the chunks are the same it is sorted). comparing two registers is like 2 CPU instructions (cmp+conditional jump per comparison, where the jump would almost never be taken if your first file's chunk fits in a register), whereas hashing is a lot more CPU instructions per byte

Why it might not be faster:

the access pattern of reading thousands of spread out files a few bytes at a time might be harder to predict for the drive/SSD than sequentially reading one file into memory
thinking about the worst case (thousands of pairs of giant duplicate files that only differ at the very end or where each pair differs at each chunk) and in that case you are doing a lot of comparisons... But if you're hashing you have to run the hashing algorithm for every single byte (each hash block size) of every file. Surely it's more CPU instructions to run a hashing algorithm for every byte than sorting out one element from an array.
not as easy to parallelize

Finally, what we're really trying to do is to find a chunk that proves that the files are different in as few file reads as possible. You could have some sort of tiny machine learning model (or just check the first chunk, last chunk and a bunch of random chunks) that looks at the first chunk of the file and the shared file size (and possibly the file name?) and generates 3-10 guesses about which chunk might contain a difference. If all of the guesses fail to find different data you're already statistically pretty sure that it's a duplicated file, which you could also take advantage of and have a flag called something like --guess or --unsafe that returns after this step for people whose data isn't really important, but that's probably a bad idea. You obviously still have to check the duplicate files in their entirety (by sorting each file chunk, as I described above) if you want 100% certainty.

I haven't read any of your code or used your tool (I just use fdupes), so apologies if you already do early stopping.

Boris Verkhovskiy · Answer 1 · Tue Jan 09 2024 12:03:32 GMT+0800 (China Standard Time)

Actually, reading https://news.ycombinator.com/item?id=28360866 and https://pkolaczk.github.io/disk-access-ordering/ you've clearly know about this idea and have thought about this more than me. Still, I could imagine a system that does both efficient sequential reading and early stopping, if you checkpoint the hash state and then periodically compare the hashes. So you'd keep track that at file chunk 23, the hash state of each file in the group was <whatever> and whenever all files in the group have been read up to that chunk you could compare the hashes up to that point and prune the tree/stop reading unique files as necessary, but that's more unnecessary comparisons for when the files are duplicates, though it would be worth it for large files.