How to upload datasets to remote S3 without compressing them?
surya9teja opened this issue · comments
Hi, I have setup opensource version of clearml in Kubernetes cluster and doing some testing. I found out, when I upload my local dataset into clearml it compressed into a zip format. Is there any way I can upload files without compressing. Most of the my dataset comprises of images and pdfs.
from clearml import Dataset
dataset = Dataset.create(
dataset_name="sample",
dataset_project="test",
output_uri="s3://sssss/clearml",
description="sample testing dataset",
)
dataset.add_files(
path="sample_dataset",
wildcard="*.jpg",
recursive=True,
)
dataset.upload(
show_progress=True,
verbose=True,
compression=None,
retries=3,
)
Also Can anyone point me to documentation for clearml in Kubernetes for settings up mangodb, redis to external instead of creating in cluster.
And Does the file uploading have any API endpoint so that I can use in my current frontend setup.
Hi @surya9teja, currently bypassing compression is not supported, but it's a good idea, and we will add it in the next version 🙂
As for your other questions, see here for where to provide connection strings for external databases (instead of the ones automatically deployed by the clearml chart)
Regarding the file uploading, where in your frontend would you like to use it? The ClearML fileserver uses a simply HTTP Form upload using multipart
Additionally, compression of all the files in large datasets comprised of thousands of small files is extremely slow. Disabling compression could solve the issue