Feature request: use sha or md5 to hash a pathlib.Path

Question

Feature request: use sha or md5 to hash a pathlib.Path

saludes opened this issue 3 years ago · comments

Assume we have a lenghty operation depending on the content of a big file.
Hashing the path (as in pathlib.Path) will give us only a hash on the file name.
What if the file grows in time by appends?

It will be useful to have it hashed automatically by sha/md5 on the content or maybe a combination of date/size as suggested in https://stackoverflow.com/questions/1761607/what-is-the-fastest-hash-algorithm-to-check-if-two-files-are-equal

Qing · Answer 1 · Tue Apr 27 2021 16:48:57 GMT+0800 (China Standard Time)

you can use hash_params

def md5(args, kwargs):
    """
    :return md5 string
    """
    filepath = args[0]
    hash_md5 = hashlib.md5()

    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    res = hash_md5.hexdigest()
    return res

@cachier(hash_params=md5)
def foo(filepath):
    return 111

Jordi Saludes · Answer 2 · Tue Apr 27 2021 18:41:54 GMT+0800 (China Standard Time)

Yes, I know; but I think it will be a fine feature to provide it as default for Path s

Shay Palachy-Affek · Answer 3 · Sat May 01 2021 16:08:42 GMT+0800 (China Standard Time)

It would definitely need to be a VERY optional feature, and not a default one, because the assumption that if someone provides the wrapped function with a Path object your'e allowed to just open it and read it through (which is what hashing file content requires) is EXTREMELY problematic. Functions can do a lot of things with Path objects which are not reading file content, and other parts of the program or code might even assume that they do not read them, or perform ANY king of I/O.

So what's missing here, before this specific feature, is a feature that defines "plugins", or some other way to customize argument handling, that is a bit more fine-grained than the existing hash_params features. For most use cases, hash_params sounds like enough.

Shay Palachy-Affek · Answer 4 · Sat May 01 2021 16:10:00 GMT+0800 (China Standard Time)

I'm closing this because I feel this is ill-fitted for a feature, but users should feel free to open this issue again and engage in a discussion and provide suggestions.