allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset creation with local storage: path substitution not working

nfzd opened this issue · comments

commented

Describe the bug

  • Created a dataset with a local folder as output_uri.
  • Changed the location of the folder.
  • Added a path substitution rule, but loading the dataset does not work.

To reproduce

Contents of clearml.conf:

api {
    (...)
}

In folder /home/user/clearml_path_substitution.

Contents of file create.py:

from clearml import Dataset

dataset = Dataset.create(
    dataset_project="Test",
    dataset_name="Test-PathSubs",
    output_uri="/home/user/clearml_path_substitution/storage_1")

dataset.add_files(path="./data.xml")
dataset.upload()
dataset.finalize()

Contents of file load.py:

from clearml import Dataset

dataset = Dataset.get(
    dataset_project="Test",
    dataset_name="Test-PathSubs")

Create dataset:

$ mkdir storage_1
$ python3 create.py
ClearML results page: https://(...)/output/log
ClearML dataset page: https://(...)
Uploading dataset changes (1 files compressed to 125 B) to file:///home/user/clearml_path_substitution/storage_1
File compression and upload completed: total size 125 B, 1 chunk(s) stored (average size 125 B)

(Loading it at this point by running load.py works as expected.)

Move the storage location:

$ mv storage_1 storage_2

Add the path substitution rule:

$ (...)
$ cat ~/clearml.conf
api {
    (...)
}
sdk {
    storage {
        path_substitution = [
            # Replace registered links with local prefixes,
            # Solve mapping issues, and allow for external resource caching.
            {
                registered_prefix = "file:///home/user/clearml_path_substitution/storage_1"
                local_prefix = "file:///home/user/clearml_path_substitution/storage_2"
            }
        ]
    }
}

Try loading from the new location:

$ python3 load.py
Traceback (most recent call last):
  File "/home/user/clearml_path_substitution/load.py", line 3, in <module>
    dataset = Dataset.get(
  File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1778, in get
    instance = get_instance(dataset_id)
  File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1690, in get_instance
    raise ValueError("Could not load Dataset id={} state".format(task.id))
ValueError: Could not load Dataset id=(...) state

Expected behaviour

Loading should be possible from the new storage location using path substitution.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.14.0
  • ClearML Server Version: WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
  • Python Version: 3.10
  • OS: Linux
commented

Does anyone have an idea what the problem could be or how to debug the issue?

Hi @nfzd ! Looks like the StorageHelper tries to access file:// links directly, without applying file substitution, and if the referenced file does not exist, then the program will raise an error.
We will need to fix this on our side (or if you wish to contribute you could open a PR that handles path substitutions in

def get_direct_access(self, remote_path, **_):
).

The only workaround I can think of is forcing get_direct_access to return None:

from clearml.storage.helper import _FileStorageDriver
_FileStorageDriver.get_direct_access = lambda *args: None

# should work
from clearml import Dataset
d = Dataset.get("d2412eff1f7f462fb6c81065e043cd8b")