Feature request: use sha or md5 to hash a pathlib.Path
saludes opened this issue · comments
Assume we have a lenghty operation depending on the content of a big file.
Hashing the path (as in pathlib.Path) will give us only a hash on the file name.
What if the file grows in time by appends?
It will be useful to have it hashed automatically by sha/md5 on the content or maybe a combination of date/size as suggested in https://stackoverflow.com/questions/1761607/what-is-the-fastest-hash-algorithm-to-check-if-two-files-are-equal
you can use hash_params
def md5(args, kwargs):
"""
:return md5 string
"""
filepath = args[0]
hash_md5 = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
res = hash_md5.hexdigest()
return res
@cachier(hash_params=md5)
def foo(filepath):
return 111
Yes, I know; but I think it will be a fine feature to provide it as default for Path
s
It would definitely need to be a VERY optional feature, and not a default one, because the assumption that if someone provides the wrapped function with a Path
object your'e allowed to just open it and read it through (which is what hashing file content requires) is EXTREMELY problematic. Functions can do a lot of things with Path
objects which are not reading file content, and other parts of the program or code might even assume that they do not read them, or perform ANY king of I/O.
So what's missing here, before this specific feature, is a feature that defines "plugins", or some other way to customize argument handling, that is a bit more fine-grained than the existing hash_params
features. For most use cases, hash_params
sounds like enough.
I'm closing this because I feel this is ill-fitted for a feature, but users should feel free to open this issue again and engage in a discussion and provide suggestions.