allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug/Enhancement: Slow dataset verification

charlienewey-odin opened this issue · comments

Describe the bug

Dataset verification is slow when verifying lots of small files. This is especially true on e.g. NFS drives.

To reproduce

Download a dataset, then download it again.

from clearml import Dataset

d = Dataset.get(dataset_id="abcdefg")

# Populate cache, verification happens here and is slow
d.get_local_copy()

# Verification on a pre-downloaded/cached dataset is also slow
d.get_local_copy()

Expected behaviour

Verification (i.e. file size checking) can theoretically happen in parallel on certain disk types - especially NFS drives that have multiple copies of stored data (e.g. Ceph, GlusterFS, or in my case, GCP Filestore).

Environment

  • SDK version: 1.13.1
  • ClearML Server Version: Enterprise
  • OS/Python not relevant

Related Discussion

Slack thread: https://odin-vision.slack.com/archives/C055MNE258R/p1696591022780369