Datasets with local storage & changing output_uri
nfzd opened this issue · comments
Proposal Summary
Make the base output_uri
of Dataset artifacts configurable somehow if the storage location changes.
One possibility would be to add a kwarg to Dataset.finalize()
and Dataset.get()
which can rename the artifact URIs. I could send a PR for that if you agree this is a good solution?
Motivation
We need to store our datasets on a network drive. We also have Linux workers and users with Windows.
The network drive has some location, say, /mnt/data
on the workers. This path cannot be be used on Windows, where it will be something like Z:\
. (We tried some hacks with network paths on Windows, but did not find a working solution.)
Windows users should be able to both create datasets and use them locally. The Linux agents also need to be able to load them.
Proposal
The clean solution IMHO would be to store path that the agents will use on the server. This would require something like:
- On Windows, create the dataset, use
output_uri='Z:\'
- Run
Dataset.finalize()
with an extended version which can renameZ:\
to/mnt/data
- Agents: will work just fine.
- Loading the dataset on windows: run
Dataset.get()
with an extended version which can rename/mnt/data
back toZ:\
The extension would in both cases be something like a kwarg
def finalize(
...
output_uri_renamer: Optional[Callable] = None,
....
)
which (if passed) can rename the artifact paths before saving in finalize()
and before loading in get()
. You would call it, in our case, with:
dataset.finalize(
output_uri_renamer=lambda path: path.replace("\", "/").replace("Z:", "/mnt/data")
)
Related: #747
@nfzd I think this is exactly the scenario for which path substitution was introduced, is it not?
It should simply be configured for each consumer for which the original registered URL is inadequate.