RunpeiDong / DreamLLM

[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation

Home Page:https://dreamllm.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to train with laion400m data in Stage1?

pkulwj1994 opened this issue · comments

Hi Runpei,

Great appreciation for your work. I am trying to test the stage-1 training, but I find that the Laion400m data is a little bit confusing. My issue is how I can use the Laion400m data for training, could you please give a clear instruction? Thank you!

The original code for the definition of the dataset is in the following. I don't know where to get the "data/resources/laion400m_origin20m_shard_list.json" file

source code:
L(WebDatasetInfo)( name="laion400m_orig", description="The length and width of the image are the original size, but only 20M was downloaded.", dataset_type=DatasetType.ImageTextPair, cls=UnifiedITPairWebdataset, approx_size="20M", shard_list_path="data/resources/laion400m_origin20m_shard_list.json", ),

Best wishes.

Hi @pkulwj1994,

Please refer to this issue