Enhancement Request: Improved ClearML Data Management for Child Datasets

Question

Heegreis opened this issue a year ago · comments

Proposal Summary

I've been using ClearML Data and encountered several issues with child datasets. Specifically:

When renaming or changing the path of files that are the same, the "FILES CHANGED" log shows them as "Added 1" and "Removed 1". It would be more intuitive if they were recorded as "Renamed", similar to Git's behavior. Additionally, the fileserver retains duplicate files even after renaming, which could be addressed by linking files in child datasets to parent dataset files using their SHA identifiers. Here's the process I followed using ClearML Data to rename files in a child dataset: remove the file -> add the same file with a new name.
If a file is removed and then the same file (with the same filename and path) is added back, the "FILES CHANGED" log registers it as "Modified 1". However, in essence, no actual changes were made to the dataset content. Furthermore, the fileserver stores identical files (same filename, path, and content) redundantly.

By addressing these issues, I believe we can achieve better dataset state management and significantly reduce the fileserver's storage consumption.

Noam Wasersprung · Answer 1 · Sun Aug 13 2023 22:48:59 GMT+0800 (China Standard Time)

Thanks for proposing @Heegreis.

We'll look into how this can be addressed in future versions.