svjack / img2dataset-pq2hf-transform-toolkit

A simple toolkit to transform datasource generate by img2dataset from parquet file to Huggingface dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


img2dataset-pq2hf-transform-toolkit

A simple toolkit to transform datasource generate by img2dataset from parquet file to Huggingface dataset.

Brief introduction

img2dataset can easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Which is a simple and convenient tool that people can use it as a image dataset source retrieve toolkit.

Unfortunately, It not provide toolkit that can transform the download dataset into Huggingface official dataset format. i.e. datasets.

If one take the dataset format in datasets's form, then it will be a seamless connection between img2dataset and massive projects in the Huggingface transformers' ecosphere. This project give a simple toolkit to transform datasource generate by img2dataset from parquet file to Huggingface dataset. And test it on a sample of Conceptual Captions (CC3M) dataset. And also works for the fully Conceptual Captions (CC3M) dataset (I have test it by training svjack/concept-caption-3m-sd-lora-en and svjack/concept-caption-3m-sd-lora-zh from 400000 images download by this project)

Installtation

pip install -r requirements.txt

Use Step

  • 1 call img2dataset from console to download images with captions
#### run in shell
img2dataset --url_list data/cc3m_1000.tsv --input_format "tsv"\
         --url_col "url" --caption_col "caption" --output_format parquet\
           --output_folder data/cc3m_files_no --processes_count 16 --thread_count 64 --resize_mode no\
             --enable_wandb False

The above cmd command will download the dataset into data/cc3m_files_no in parquet format

  • 2 refer to main.py process the dataset step by step. This will retrieve all valid files in parquets, and only keep valid download images. And save them into a parquet file in "data/cc3m_tiny_no.parquet"
file_list = retrieve_all_valid_path("data/cc3m_files_no/")
target_parquet_path = "data/cc3m_tiny_no.parquet"
save_to_one_parquet_func(file_list, target_parquet_path)
  • 3 transform the "data/cc3m_tiny_no.parquet" into huggingface dataset format

The implementation decode the download image bytes into PIL's image save and read them to construct a Huggingface dataset. the image path construct by sha256_column

Or, if you want to change it to another, you can set sha256_column = None, and sha256_gen_apply_column to the column you used, Now sha256_gen_apply_column can be a column with text type

when both set sha256_column = None and sha256_gen_apply_column = None, the program use image_col as input, call tobytes to generate sha256.

ds = transform_to_hf_ds(target_parquet_path, "jpg", "caption",
    image_process_func = jpg_val_to_img,
    sha256_column = "sha256"
)

This will produce the Huggingface dataset instance as output.

  • 4 Push the final produce to the hub. (Or you can use yourself)
ds.push_to_hub("svjack/cc3m_500_sample")

Contact

svjack - svjackbt@gmail.com - ehangzhou@outlook.com

Project Link:https://github.com/svjack/img2dataset-pq2hf-transform-toolkit

Acknowledgements

About

A simple toolkit to transform datasource generate by img2dataset from parquet file to Huggingface dataset.


Languages

Language:Python 100.0%