Bug/Enhancement: Slow dataset verification
charlienewey-odin opened this issue · comments
Describe the bug
Dataset verification is slow when verifying lots of small files. This is especially true on e.g. NFS drives.
To reproduce
Download a dataset, then download it again.
from clearml import Dataset
d = Dataset.get(dataset_id="abcdefg")
# Populate cache, verification happens here and is slow
d.get_local_copy()
# Verification on a pre-downloaded/cached dataset is also slow
d.get_local_copy()
Expected behaviour
Verification (i.e. file size checking) can theoretically happen in parallel on certain disk types - especially NFS drives that have multiple copies of stored data (e.g. Ceph, GlusterFS, or in my case, GCP Filestore).
Environment
- SDK version: 1.13.1
- ClearML Server Version: Enterprise
- OS/Python not relevant
Related Discussion
Slack thread: https://odin-vision.slack.com/archives/C055MNE258R/p1696591022780369