The training pipeline and training time of this work
linhuixiao opened this issue · comments
Linhui Xiao commented
The training pipeline and training time of this work
-
I understand that the paper uses the Resnet and text transformer of CLIP for initialization, and then uses the region-text pair extracted from CC3M for pre-training, and then splices the pre-trained model and RPN together and trains the RPN on the base class, right?
-
CC3M has 3 million pictures. How many GPUs are used for pre-training and how much training time is consumed? This is not mentioned in the paper.
Thanks.
Yiwu Zhong commented
- Yes. The visual backbone is initialized by CLIP and further pre-trained by region-text pairs. The RPN is trained by the boxes of base classes, yet not using categorical labels.
- You can refer to the pre-training script here. For ResNet50, it takes roughly 6 days using 32 V100 GPUs.
Linhui Xiao commented
Thanks a lot for your kind reply.