dgilland / hashfs

A content-addressable file management system for Python.

Home Page:http://hashfs.readthedocs.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to detect if a file was actually saved?

cannatag opened this issue · comments

Hi,
how can I detect if a file was actually saved with the copy() method? I mean, if two files with different names have the same content the hash is the same, so the first file is actually saved, while the second gets the same path of the first, but the save operation is not performed. In either cases the copy() method returns the address id of the file. How can I detect this? This information is helpful for to know the ratio of duplicate files.

Thanks,
Giovanni

Currently, you'd have to pre-check whether the file exists before calling put:

obj = # path or readable object
stream = hashfs.hashfs.Stream(obj)  # if you need an iterable wrapper for object
file_id = fs.computehash(stream)

if fs.exists(file_id):
    # Handle duplicate file
    filepath = fs.idpath(file_id)
    address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)
else:
    # Handle unique file
    address = fs.put(stream)

This is somewhat sub-optimal since the file hash is calculated twice if put is used in the else block. You could replace it with fs.copy but you'd only get the filepath returned so if you wanted the HashAddress in both cases, then this may work better:

file_id = fs.computehash(stream)
filepath = fs.idpath(file_id)
address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)

if fs.exists(file_id):
    # Handle duplicate file
else:
    # Handle unique file
    fs.copy(stream, file_id)

However, neither of these is ideal since you have to utilize more of the internal implementation of HashFS. I'm open to suggestions to make this less painful.

Hi Derrick, thanks for your response. Could it be possibile to return a tuple from the copy() method? It could return the id and a boolean indicating if the file was actually saved or not. Or if you don't want to modify the copy() method you could add another method (something like verify_copy() ) that does the check:

    def verify_copy(self, stream, id, extension=None):
        """Copy the contents of `stream` onto disk with an optional file
        extension appended. The copy process uses a temporary file to store the
        initial contents and then moves this file to it's final location.
        """
        filepath = self.idpath(id, extension)

        # Only copy file if it doesn't already exist.
        if not os.path.isfile(filepath):
            with tmpfile(stream, self.fmode) as fname:
                self.makepath(os.path.dirname(filepath))
                shutil.copy(fname, filepath)
            return filepath, True
        else:
            return filepath, False

Are you using copy directly? Returning whether the file exists would also need to be returned by put if copy is modified since put is meant for the public API while copy is a more lower-level utility.

Changing the return of put would mean a breaking change so I need to consider this carefully even though it's still pre-v1.

Another option would be to modify HashAddress to contain an attribute that indicates whether the file existed previously but that value would only have meaning when calling put. This would be a non-breaking change.

Seems like the options are:

  1. Modify return of put/copy to indicate whether the file existed previously.
  2. Create new methods for put/copy (perhaps verify_put/verify_copy) that return whether the file existed previously.
  3. Modify HashAddress to include an attribute that indicates previous file existence.

After some more thought, I am leaning towards #3, modifying HashAddress to include whether the file was a duplicate during a put operation.

#3 looks the right solution to me because it should not break the interface contract. I will change my code to use the put() method instead of copy. May I suggest you to change the internal methods (those that are not supposed to be called directly by external code) to _method()? This the the standard way to indicate that a method is not public.

Thanks again for your HashFS project.
Giovanni

This has been released as v0.5.0.

BTW, the new attribute for HashAddress is is_duplicate:

address = fs.put(stream)

if address.is_duplicate:
    # Handle duplicate file condition

Thanks. It works.

Giovanni