allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to upload datasets to remote S3 without compressing them?

surya9teja opened this issue · comments

Hi, I have setup opensource version of clearml in Kubernetes cluster and doing some testing. I found out, when I upload my local dataset into clearml it compressed into a zip format. Is there any way I can upload files without compressing. Most of the my dataset comprises of images and pdfs.

from clearml import Dataset
dataset = Dataset.create(
    dataset_name="sample",
    dataset_project="test",
    output_uri="s3://sssss/clearml",
    description="sample testing dataset",
)

dataset.add_files(
    path="sample_dataset",
    wildcard="*.jpg",
    recursive=True,
)

dataset.upload(
    show_progress=True,
    verbose=True,
    compression=None,
    retries=3,
)

Also Can anyone point me to documentation for clearml in Kubernetes for settings up mangodb, redis to external instead of creating in cluster.
And Does the file uploading have any API endpoint so that I can use in my current frontend setup.

Hi @surya9teja, currently bypassing compression is not supported, but it's a good idea, and we will add it in the next version 🙂

As for your other questions, see here for where to provide connection strings for external databases (instead of the ones automatically deployed by the clearml chart)

Regarding the file uploading, where in your frontend would you like to use it? The ClearML fileserver uses a simply HTTP Form upload using multipart

Additionally, compression of all the files in large datasets comprised of thousands of small files is extremely slow. Disabling compression could solve the issue