Dataset creation with local storage: path substitution not working
nfzd opened this issue · comments
Describe the bug
- Created a dataset with a local folder as
output_uri
. - Changed the location of the folder.
- Added a path substitution rule, but loading the dataset does not work.
To reproduce
Contents of clearml.conf
:
api {
(...)
}
In folder /home/user/clearml_path_substitution
.
Contents of file create.py
:
from clearml import Dataset
dataset = Dataset.create(
dataset_project="Test",
dataset_name="Test-PathSubs",
output_uri="/home/user/clearml_path_substitution/storage_1")
dataset.add_files(path="./data.xml")
dataset.upload()
dataset.finalize()
Contents of file load.py
:
from clearml import Dataset
dataset = Dataset.get(
dataset_project="Test",
dataset_name="Test-PathSubs")
Create dataset:
$ mkdir storage_1
$ python3 create.py
ClearML results page: https://(...)/output/log
ClearML dataset page: https://(...)
Uploading dataset changes (1 files compressed to 125 B) to file:///home/user/clearml_path_substitution/storage_1
File compression and upload completed: total size 125 B, 1 chunk(s) stored (average size 125 B)
(Loading it at this point by running load.py
works as expected.)
Move the storage location:
$ mv storage_1 storage_2
Add the path substitution rule:
$ (...)
$ cat ~/clearml.conf
api {
(...)
}
sdk {
storage {
path_substitution = [
# Replace registered links with local prefixes,
# Solve mapping issues, and allow for external resource caching.
{
registered_prefix = "file:///home/user/clearml_path_substitution/storage_1"
local_prefix = "file:///home/user/clearml_path_substitution/storage_2"
}
]
}
}
Try loading from the new location:
$ python3 load.py
Traceback (most recent call last):
File "/home/user/clearml_path_substitution/load.py", line 3, in <module>
dataset = Dataset.get(
File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1778, in get
instance = get_instance(dataset_id)
File "/home/user/miniconda3/envs/clearml/lib/python3.10/site-packages/clearml/datasets/dataset.py", line 1690, in get_instance
raise ValueError("Could not load Dataset id={} state".format(task.id))
ValueError: Could not load Dataset id=(...) state
Expected behaviour
Loading should be possible from the new storage location using path substitution.
Environment
- Server type: self hosted
- ClearML SDK Version: 1.14.0
- ClearML Server Version: WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
- Python Version: 3.10
- OS: Linux
Does anyone have an idea what the problem could be or how to debug the issue?
Hi @nfzd ! Looks like the StorageHelper tries to access file://
links directly, without applying file substitution, and if the referenced file does not exist, then the program will raise an error.
We will need to fix this on our side (or if you wish to contribute you could open a PR that handles path substitutions in
clearml/clearml/storage/helper.py
Line 1817 in d4e1363
The only workaround I can think of is forcing get_direct_access
to return None:
from clearml.storage.helper import _FileStorageDriver
_FileStorageDriver.get_direct_access = lambda *args: None
# should work
from clearml import Dataset
d = Dataset.get("d2412eff1f7f462fb6c81065e043cd8b")