johncaged / OPT_Questioner

Official PyTorch implementation of the paper "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about preparing CC3M

xk-huang opened this issue · comments

Thank you so much for releasing the code!

I am wondering how I can prepare CC3M dataset. Are there any tutorials about this?
Besides, CC3M is huge so I'm also curious about storage and how to resume the training process after interruption.

Thanks in advance!

Hi there! For the CC3M dataset, we strongly recommend you to follow the official instruction at https://github.com/google-research-datasets/conceptual-captions. The file downloaded from the official CC3M dataset contains image-URL/caption pairs, and the image-URLs can be transferred to 'image ids' in our CC3M-QA-DC dataset as described in our README.md. Then you can use tools like 'img2dataset' (https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) to download the images in the CC3M dataset. Data preparation may vary according to different downloading tools, but the most important thing is correctly mapping between the image-URLs (in the CC3M dataset) and the image ids (in our CC3M-QA-DC dataset). The whole training images take about 450GB storage space, but this may also vary depending on the time you collect the images (because some URLs may be invalid through time). After everything is ok, you can use the QA and dense captioning data in our CC3M-QA-DC dataset to train your own model. For resuming the training process, there's nothing different from other training pipelines and you can simply load the checkpoint, restore other settings (lr_decay, etc.) and continue your own training on the remaining data.

@johncaged can you provied more detaile about the CC3M-QA-DC dataset? such as the images shape and pad information . I have download the CC3M-QA-DC json files and CC3M images, but found bbox and object in image mismatch when visiualizing bbox in image

@johncaged can you provied more detaile about the CC3M-QA-DC dataset? such as the images shape and pad information . I have download the CC3M-QA-DC json files and CC3M images, but found bbox and object in image mismatch when visiualizing bbox in image

In order to facilitate pre-training, the bboxes of the objects are linearly scaled into between 0 and 223 (integers only). You can transform the bboxes according to the actual size of the images like the following:

bbox = [1, 2, 3, 4]
H, W = (1080, 1920)

real_bbox = [0, 0, 0, 0]
real_bbox[0] = round(bbox[0] * W / 224)
real_bbox[1] = round(bbox[1] * H / 224)
real_bbox[2] = round(bbox[2] * W / 224)
real_bbox[3] = round(bbox[3] * H / 224)

However, the prediction results of the object detection model still exhibit some level of noise, and we recommend to use the dataset in pre-training process.