microsoft / RegionCLIP

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The training pipeline and training time of this work

linhuixiao opened this issue · comments

The training pipeline and training time of this work

  1. I understand that the paper uses the Resnet and text transformer of CLIP for initialization, and then uses the region-text pair extracted from CC3M for pre-training, and then splices the pre-trained model and RPN together and trains the RPN on the base class, right?

  2. CC3M has 3 million pictures. How many GPUs are used for pre-training and how much training time is consumed? This is not mentioned in the paper.

Thanks.

  1. Yes. The visual backbone is initialized by CLIP and further pre-trained by region-text pairs. The RPN is trained by the boxes of base classes, yet not using categorical labels.
  2. You can refer to the pre-training script here. For ResNet50, it takes roughly 6 days using 32 V100 GPUs.

Thanks a lot for your kind reply.