allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Local data sync into clearml-data

nikiniki1 opened this issue · comments

Hi!
I'm going to use clearml data like this:

  1. I Have dataset probably around 700Gb. When I want to solve a problem, I select a subsample from them and use it as a train/test data. And when I feed only txt with paths (data_path) of subsample.
  2. So, when I use clearml I have to initalize dataset = Dataset()) and after call dataset.sync_folder(). But if I use it this way, then clearml will chunk my data and load it into filestorage, so I end up with duplicates in the data.
  3. I don’t want clearml to duplicate the data, but I just want it to monitor the shared folder with all the data and show only the paths for the selected ones.
    How can I solve this problem?

@nikiniki1 Dataset.sync_folder is intended to do exactly that: synchronize data between two locations.
If your use case uses a single location, I think Dataset.add_external_files is what you need.

Does this help?