Object Centric Open Vocabulary Detection (NeurIPS 2022)

Official repository of paper titled "Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection".

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

🚀 News

(Sep 15, 2022)
- Paper accepted at NeurIPS 2022.
(July 7, 2022)
- Training and evaluation code with pretrained models are released.

Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an absolute 11.9 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall.

Main Contributions

Region-based Knowledge Distillation (RKD) adapts image-centric language representations to be object-centric.
Pesudo Image-level Supervision (PIS) uses weak image-level supervision from pretrained multi-modal ViTs(MAVL) to improve generalization of the detector to novel classes.
Weight Transfer function efficiently combines above two proposed components.

Installation

The code is tested with PyTorch 1.10.0 and CUDA 11.3. After cloning the repository, follow the below steps in INSTALL.md. All of our models are trained using 8 A100 GPUs.

Results

We present performance of Object-centric Open Vocabulary object detector that demonstrates state-of-the-art results on Open Vocabulary COCO and LVIS benchmark datasets. For COCO, base and novel categories are shown in purple and green colors respectively.

Open-vocabulary COCO

Effect of individual components in our method. Our weight transfer method provides complimentary gains from RKD and ILS, achieving superior results as compared to naively adding both components.

Name	APnovel	APbase	AP	Train-time	Download
Base-OVD-RCNN-C4	1.7	53.2	39.6	8h	model
COCO_OVD_Base_RKD	21.6	54.4	45.8	8h	model
COCO_OVD_Base_PIS	34.2	52.0	47.4	8.5h	model
COCO_OVD_RKD_PIS	35.3	52.9	48.3	8.5h	model
COCO_OVD_RKD_PIS_WeightTransfer	40.3	54.1	50.5	8.5h	model
COCO_OVD_RKD_PIS_WeightTransfer_8x	40.5	56.7	52.5	2.5 days	model

New LVIS Baseline

Our Mask R-CNN based LVIS Baseline (mask_rcnn_R50FPN_CLIP_sigmoid) achieves 12.2 rare class and 20.9 overall AP and trains in only 4.5 hours on 8 A100 GPUs. We believe this could be a good baseline to be considered for the future research work in LVIS OVD setting.

Name	APr	APc	APf	AP	Epochs
PromptDet Baseline	7.4	17.2	26.1	19.0	12
ViLD-text	10.1	23.9	32.5	24.9	384
Ours Baseline	12.2	19.4	26.4	20.9	12

Open-vocabulary LVIS

Name	APr	APc	APf	AP	Train-time	Download
mask_rcnn_R50FPN_CLIP_sigmoid	12.2	19.4	26.4	20.9	4.5h	model
LVIS_OVD_Base_RKD	15.2	20.2	27.3	22.1	4.5h	model
LVIS_OVD_Base_PIS	17.0	21.2	26.1	22.4	5h	model
LVIS_OVD_RKD_PIS	17.3	20.9	25.5	22.1	5h	model
LVIS_OVD_RKD_PIS_WeightTransfer	17.2	21.5	26.6	22.8	5h	model
LVIS_OVD_RKD_PIS_WeightTransfer_8x	21.1	25.0	29.1	25.9	1.5 days	model

t-SNE plots

Training and Evaluation

To train or evaluate, first prepare the required datasets.

To train a model, run the below command with the corresponding config file.

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

Note: Some trainings are initialized from Supervised-base or RKD models. Download the corresponding pretrained models and place them under $object-centric-ovd/saved_models/.

To evaluate a pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Citation

If you use our work, please consider citing:

@inproceedings{Hanoona2022Bridging,
    title={Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection},
    author={Rasheed, Hanoona and Maaz, Muhammad and Khattak, Muhammad Uzair  and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={36th Conference on Neural Information Processing Systems (NIPS)},
    year={2022}
}
    
@inproceedings{Maaz2022Multimodal,
      title={Class-agnostic Object Detection with Multi-modal Transformer},
      author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
      booktitle={17th European Conference on Computer Vision (ECCV)},
      year={2022},
      organization={Springer}
}

Contact

If you have any questions, please create an issue on this repository or contact at hanoona.bangalath@mbzuai.ac.ae or muhammad.maaz@mbzuai.ac.ae.

References

Our RKD and PIS methods utilize the MViT model Multiscale Attention ViT with Late fusion (MAVL) proposed in the work Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022). Our code is based on Detic repository. We thank them for releasing their code.

f2010126 / object-centric-ovd