The training pipeline and training time of this work

Question

linhuixiao opened this issue 2 years ago · comments

I understand that the paper uses the Resnet and text transformer of CLIP for initialization, and then uses the region-text pair extracted from CC3M for pre-training, and then splices the pre-trained model and RPN together and trains the RPN on the base class, right?
CC3M has 3 million pictures. How many GPUs are used for pre-training and how much training time is consumed? This is not mentioned in the paper.

Thanks.

Yiwu Zhong · Answer 1 · Tue Mar 28 2023 11:48:06 GMT+0800 (China Standard Time)

Yes. The visual backbone is initialized by CLIP and further pre-trained by region-text pairs. The RPN is trained by the boxes of base classes, yet not using categorical labels.
You can refer to the pre-training script here. For ResNet50, it takes roughly 6 days using 32 V100 GPUs.

Linhui Xiao · Answer 2 · Tue Mar 28 2023 12:56:47 GMT+0800 (China Standard Time)

Thanks a lot for your kind reply.