allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enhancement Request: Improved ClearML Data Management for Child Datasets

Heegreis opened this issue · comments

Proposal Summary

I've been using ClearML Data and encountered several issues with child datasets. Specifically:

  1. When renaming or changing the path of files that are the same, the "FILES CHANGED" log shows them as "Added 1" and "Removed 1". It would be more intuitive if they were recorded as "Renamed", similar to Git's behavior. Additionally, the fileserver retains duplicate files even after renaming, which could be addressed by linking files in child datasets to parent dataset files using their SHA identifiers. Here's the process I followed using ClearML Data to rename files in a child dataset: remove the file -> add the same file with a new name.

  2. If a file is removed and then the same file (with the same filename and path) is added back, the "FILES CHANGED" log registers it as "Modified 1". However, in essence, no actual changes were made to the dataset content. Furthermore, the fileserver stores identical files (same filename, path, and content) redundantly.

Motivation

By addressing these issues, I believe we can achieve better dataset state management and significantly reduce the fileserver's storage consumption.

Thanks for proposing @Heegreis.

We'll look into how this can be addressed in future versions.