allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datasets with local storage & changing output_uri

nfzd opened this issue · comments

commented

Proposal Summary

Make the base output_uri of Dataset artifacts configurable somehow if the storage location changes.

One possibility would be to add a kwarg to Dataset.finalize() and Dataset.get() which can rename the artifact URIs. I could send a PR for that if you agree this is a good solution?

Motivation

We need to store our datasets on a network drive. We also have Linux workers and users with Windows.

The network drive has some location, say, /mnt/data on the workers. This path cannot be be used on Windows, where it will be something like Z:\. (We tried some hacks with network paths on Windows, but did not find a working solution.)

Windows users should be able to both create datasets and use them locally. The Linux agents also need to be able to load them.

Proposal

The clean solution IMHO would be to store path that the agents will use on the server. This would require something like:

  1. On Windows, create the dataset, use output_uri='Z:\'
  2. Run Dataset.finalize() with an extended version which can rename Z:\ to /mnt/data
  3. Agents: will work just fine.
  4. Loading the dataset on windows: run Dataset.get() with an extended version which can rename /mnt/data back to Z:\

The extension would in both cases be something like a kwarg

def finalize(
    ...
    output_uri_renamer: Optional[Callable] = None,
    ....
)

which (if passed) can rename the artifact paths before saving in finalize() and before loading in get(). You would call it, in our case, with:

dataset.finalize(
    output_uri_renamer=lambda path: path.replace("\", "/").replace("Z:", "/mnt/data") 
)

Related: #747

@nfzd I think this is exactly the scenario for which path substitution was introduced, is it not?

It should simply be configured for each consumer for which the original registered URL is inadequate.

commented

@ainoam Ah, nice. I was not aware of that.