How to detect if a file was actually saved?
cannatag opened this issue · comments
Hi,
how can I detect if a file was actually saved with the copy() method? I mean, if two files with different names have the same content the hash is the same, so the first file is actually saved, while the second gets the same path of the first, but the save operation is not performed. In either cases the copy() method returns the address id of the file. How can I detect this? This information is helpful for to know the ratio of duplicate files.
Thanks,
Giovanni
Currently, you'd have to pre-check whether the file exists before calling put
:
obj = # path or readable object
stream = hashfs.hashfs.Stream(obj) # if you need an iterable wrapper for object
file_id = fs.computehash(stream)
if fs.exists(file_id):
# Handle duplicate file
filepath = fs.idpath(file_id)
address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)
else:
# Handle unique file
address = fs.put(stream)
This is somewhat sub-optimal since the file hash is calculated twice if put
is used in the else
block. You could replace it with fs.copy
but you'd only get the filepath returned so if you wanted the HashAddress
in both cases, then this may work better:
file_id = fs.computehash(stream)
filepath = fs.idpath(file_id)
address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)
if fs.exists(file_id):
# Handle duplicate file
else:
# Handle unique file
fs.copy(stream, file_id)
However, neither of these is ideal since you have to utilize more of the internal implementation of HashFS
. I'm open to suggestions to make this less painful.
Hi Derrick, thanks for your response. Could it be possibile to return a tuple from the copy() method? It could return the id and a boolean indicating if the file was actually saved or not. Or if you don't want to modify the copy() method you could add another method (something like verify_copy() ) that does the check:
def verify_copy(self, stream, id, extension=None):
"""Copy the contents of `stream` onto disk with an optional file
extension appended. The copy process uses a temporary file to store the
initial contents and then moves this file to it's final location.
"""
filepath = self.idpath(id, extension)
# Only copy file if it doesn't already exist.
if not os.path.isfile(filepath):
with tmpfile(stream, self.fmode) as fname:
self.makepath(os.path.dirname(filepath))
shutil.copy(fname, filepath)
return filepath, True
else:
return filepath, False
Are you using copy
directly? Returning whether the file exists would also need to be returned by put
if copy
is modified since put
is meant for the public API while copy
is a more lower-level utility.
Changing the return of put
would mean a breaking change so I need to consider this carefully even though it's still pre-v1.
Another option would be to modify HashAddress
to contain an attribute that indicates whether the file existed previously but that value would only have meaning when calling put
. This would be a non-breaking change.
Seems like the options are:
- Modify return of
put
/copy
to indicate whether the file existed previously. - Create new methods for
put
/copy
(perhapsverify_put
/verify_copy
) that return whether the file existed previously. - Modify
HashAddress
to include an attribute that indicates previous file existence.
After some more thought, I am leaning towards #3
, modifying HashAddress
to include whether the file was a duplicate during a put
operation.
#3 looks the right solution to me because it should not break the interface contract. I will change my code to use the put() method instead of copy. May I suggest you to change the internal methods (those that are not supposed to be called directly by external code) to _method()? This the the standard way to indicate that a method is not public.
Thanks again for your HashFS project.
Giovanni
This has been released as v0.5.0.
BTW, the new attribute for HashAddress
is is_duplicate
:
address = fs.put(stream)
if address.is_duplicate:
# Handle duplicate file condition
Thanks. It works.
Giovanni