Local data sync into clearml-data
nikiniki1 opened this issue · comments
Hi!
I'm going to use clearml data like this:
- I Have dataset probably around 700Gb. When I want to solve a problem, I select a subsample from them and use it as a train/test data. And when I feed only txt with paths (data_path) of subsample.
- So, when I use clearml I have to initalize dataset = Dataset()) and after call dataset.sync_folder(). But if I use it this way, then clearml will chunk my data and load it into filestorage, so I end up with duplicates in the data.
- I don’t want clearml to duplicate the data, but I just want it to monitor the shared folder with all the data and show only the paths for the selected ones.
How can I solve this problem?
@nikiniki1 Dataset.sync_folder
is intended to do exactly that: synchronize data between two locations.
If your use case uses a single location, I think Dataset.add_external_files
is what you need.
Does this help?