DigitalSlideArchive / digital_slide_archive

The official deployment of the Digital Slide Archive and HistomicsTK.

Home Page:https://digitalslidearchive.github.io

Repository from Github https://github.comDigitalSlideArchive/digital_slide_archiveRepository from Github https://github.comDigitalSlideArchive/digital_slide_archive

Assetstore Import Tracker / Repeater

manthey opened this issue · comments

This is a summary of a long-desired feature. Once a repo is created for such a feature, any issues related to it should be moved there (e.g., #193).

We'd like to have a Girder plugin that records when any Import action is done on an assetstore. This would record all of the options: path, destination, etc. for arbitrary assetstore types (probably by hooking the import endpoint event), plus the time that the import started.

We want to show a list of import actions, sorted most-recent first with appropriate details and a button to repeat the import exactly as done before. This list would be accessible from a button somewhere on the assetstore list page and would probably need to be paged. For repeated imports with exactly the same options and assetstore, maybe instead of showing each import as separate line, it would show a "number of times" and the most recent time? In the list, we want to show sensible names, not just girder ids, for collections and folders.

As a bonus, it would be great if when we went to an assetstore import page we showed the last few (10?) imports that were done for that assetstore, so that the user could redo them or see how they wanted to do something differently.

The further feature would be optionally modifying how repeated imports are done: currently if a file doesn't exist in the expected target directory, it is created. We frequently import a directory-tree of files, then organize them in Girder so they are not conceptually in the original directory-tree. Reimporting makes duplicates of all of these files. It would be great if there were an option in import to "skip if file already is in Girder somewhere" -- this can be done by matching the import path. If the file size has changed, we would update the existing file. The more sophisticated method would be to use the computed hash and match on that -- the file might have been renamed either on the assetstore OR in Girder, and, if the hash matches, it would be nice to not have a duplicate. This would be slower, as the hash has to be computed.

It would be nice to have a feature to flag any file in girder that is no longer available on an assetstore. For filesystem assetstores, this would confirm the path is reachable. For S3 assetstores, this would have to confirm the asset is still in the bucket (so would probably be slow). If we did this, we would probably want to show a list of such files (or only such files on a specific assetstore, or only such files from a specific import path) and then have an option to delete associated Girder items (and probably prune empty girder folders, too).

@dgutman Did I miss anything in our desired feature list here? I recognize that you would like a cron-like task to repeat imports at some point. I think we need hash-matching for that to actually do what we want, and I think it is too risky to ever automate deleting missing items. If we ever cron imports, then we should probably cron checking for missing files and report that somewhere (next to the imports list, maybe?) so that the admin can decide what to do.

Ages ago I was involved in a project where we automatically added and removed files from a database when they came and when on NAS-like devices. Devices with intermittent availability (for instance, across any network) made auto removal very risky.

The import endpoint supports include/exclude RegEx . We don't expose that in UI (we probably should).

It sounds like when we check for missing files, we would just add some chunk of metadata to the file (and possibly to its parent item) that we could remove again if the file comes back. Then showing missing ones could trivially be done by a virtual folder that matches on that metadata. Since the check for something being present/missing is likely to be stale when we actually try to access something, then any actions we take that expect that flag to be one way or another would have to check again.

Throwing errors when a file is missing is outside the scope of this plugin (and probably differs in the Girder interface versus the HistomicsUI interface). Let's address what we want to do about that in a different issue.

I don't know enough about the Girder implementation to know whether this is sufficiently relevant, but just in case it is ... rsync handles both the check hash and check file size options, and it can avoid re-transferring something that has been temporarily absent via its --link-dest flag. The command line from a source directory to its newest copy looks in several previous copies via something like:

rsync -a farway:MySource/ 2022-03-01/ --link-dest=2022-02-28/ --link-dest=2022-02-27/ --link-dest=2022-02-26/

Although 2022-03-01/ in this example should start as an empty directory, only the files that have changed will be copied there. The rest of the files will be there too but they will be hard links from one of the correspondingly located files in the directories listed via --link-dest, assuming they match the hash code and timestamp. Additionally, because these links are hard links, we can delete old dates without losing a file that is also present in a more modern date's directory.

N.B. the last I checked, which was about 10 years ago, there was a limit of, maybe, 20 --link-dest directories. Also, I don't recall what the defaults are for rsync checking both the timestamp and the hashcode; it may be necessary to turn on those checks explicitly.

@Leengit We aren't copying anything in this -- we are just indexing files that exist somewhere -- it could be a filesystem or an S3 bucket or a GridFS server, etc. "Import" is an indexing operation, not a copy operation.

@AlmightyYakob We should move the individual parts of this task to issues on https://github.com/DigitalSlideArchive/import-tracker.

I've moved all the details from this issue to separate issues in https://github.com/DigitalSlideArchive/import-tracker, so I'm closing this issue.