rom1504 / laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to set '--url_list' parameter in download_images.sh?

qiaogh97 opened this issue · comments

Hi @rom1504 !
If I'd like to download images from 'part-[00000-00031]-03f11a48-0c63-4b59-a590-c03169a0d265-c000.snappy.parquet', how to set the '--url_list' parameter? Should I make a dir named 'laion400m-meta' and put all the *.parquet in this dir?

Another question, can I set '--url_list' as one of the '*.parquet' files to download part of this dataset? Like,
img2dataset --url_list part-00000-03f11a48-0c63-4b59-a590-c03169a0d265-c000.snappy.parquet --input_format "parquet"\ --url_col "URL" --caption_col "TEXT" --output_format webdataset\ --output_folder your_output_folder --processes_count 16 --thread_count 128 --image_size 256\ --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True

Hi, yes to both questions!

Thank you!