cboettig / contentid

:package: R package for working with Content Identifiers

Home Page:http://cboettig.github.io/contentid

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preserve original file name?

joelnitta opened this issue · comments

Would it be possible to add an option to resolve() to preserve the original filename? Sometimes (for better or worse) there is useful metadata in the filename that one may want. My specific use-case is verifying the date of files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/, which have the date as part of the name.

I'd say that keeping track of the provenance of data (including their original filename and method used to retrieve the content) is key to helping to preserve the content, especially when the provenance logs are treated as content themselves. To me, the question "Give me content with content id hash://sha256/abc...?" goes hand in hand with the question "What is the origin of content with id hash://sha256/abc... ?"

Thanks @joelnitta , very sympathetic to this.

On wrinkle to keep in mind is that the same content can often have multiple filenames -- e.g. I often see this in cases where flatfile snapshots are produced periodically with timestamps in the filename. For any given snapshot, sometimes the data is unchanged and thus the hash is unchanged.

I second @jhpoelen that this is really an issue of broader metadata management -- e.g. file size, created/modified timestamps, file content type, etc, are also often essential -- so it is unsurprising that POSIX filesystems, web headers, object stores, and common metadata formats all frequently try to capture this information too.

My current preferred solution is to write json metadata files, like https://github.com/ropensci/taxadb/blob/master/inst/extdata/schema.json or https://github.com/ropensci/rfishbase/blob/master/inst/prov/fb.prov, that record the association of a filename and a hash (along with any other metadata you might want to add) using the schema.org spec. Given that there are so many standardized metadata formats for this with software ecosystems built around them, I'm reluctant to invent a new one in contentid that only we use. But also, I know that not all users will like my current preference of schema.org (probably including me n-years in the future or n-years in the past!).

I fully agree that 1) we need this metadata to interpret things, and 2) we should reuse and leverage existing metadata schemes for it. When objects are registered with DataONE, we support many metadata dialects. We've discussed this before in other contentid issues, and I wrote up a summary of areas for improvement in contentid in the context of an intro tutorial to the concepts for researchers. One I proposed is supporting less opaque metadata about objects to make them easier to use, such as names and fileNames. I also show an example of how to generate a citation for a contentid object that is stored in DataONE. I'd love to work out standard approaches to that metadata access across systems.

Note that Preston uses Provenance Ontology (PROV) and Provenance Authoring and Versioning (PAV) ontology to (automatically) keep track of the content origin. And, the methods used to link provenance to their content using . . . drumroll . . . content ids . . . can use any kind of meta data format as long as it is digital.

General concept described in pre-print

Elliott, M. J., Poelen, J. H., & Fortes, J. (2022, August 29). Signed Citations: Making Persistent and Verifiable Citations of Digital Scientific Content. https://doi.org/10.31222/osf.io/wycjn

and attached uncorrected Scientific Data paper proof .

I see data and metadata as one thing, and realize that they are linked, and should be linked in a verifiable way.

proof_41597_2023_2230_OnlinePDF.pdf

Curious to see how this scheme would integrate with the scheme you are proposing.

Coming back to this... sounds like a great paper @jhpoelen, I'll take a look.

I think the major challenge with file naming is that often there is more than one that is legit. Within DataONE, each contentid can be associated with more than one schema:Dataset, sometimes from different people, and have different metadata in each. We frequently find the same csv file included with different file names in different datasets -- we can tell they are the same due to the hash match, but it might be named and arranged very differently by different people. A rough model of the relationships we frequently see is:

classDiagram
    Dataset "*" --o "0..*" DataObject
    DataObject "*" --> "1" Sha256ContentId : has
    class DataObject {
        +String PID
        +String fileName
        +String filePath
    }
    DataObject "*" --> "1" Sha512ContentId : has
    DataObject "*" --> "1" MD5ContentId : has
    Sha256ContentId --|> ContentId : is a
    Sha512ContentId --|> ContentId : is a
    MD5ContentId --|> ContentId : is a

where the PID is an authority-based identifier (such as a UUID or DOI) and the fileName and path are frequently specific to a particular dataset arrangement.

I wrote up a reproducible data access tutorial on this stuff for a course we taught in 2021 -- and included in it an approach that I could see fruitful of being able to provide metadata descriptions based on contentid values. The example I give in the tutorial is being able to generate the citation for a dataset (e.g., for credit) for a specific contentid that was referenced in a script.