microsoft / RegionCLIP

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"

Repository from Github https://github.commicrosoft/RegionCLIPRepository from Github https://github.commicrosoft/RegionCLIP

region features can't match text features while calculating similarity

epistimi22 opened this issue · comments

commented

Hi, thanks for your contributive work.
I'm trying some demos based on your pretrained checkpoints. And I followed your settings in modeling.meta_arch.PretrainFastRCNN which I believe is your base model for pretraing.
According to your codes in self.get_region_features and self.region_concept_matching, we can get image features and text features, respectively. But the dimension of text features is fixed to 1024 while of image features is 2048, because of your design of ModifiedResNet.
def _shared_roi_transform(self, features, boxes, backbone_res5):
x = self.pooler(features, boxes)
return backbone_res5(x)
In this function above, features are 1024-d and x would be 1024-d also, backbone_res5 is backbone.layer4, and output will be 2048-d.
I didn't find any transformation before you calculated the similarity of concepts and regions in region_concept_matching.
So, could you please help me about this issue, thanks again.

commented

Appreciate your generous response, that would be great of help.