05/06/2024: Our 2nd version manuscript is accepted by TPAMI.
Please remain tuned as this repo will be maintained on a week-to-week basis.
- Add detailed impls of each method, such as template prompts vs learnable prompts, CLIP text encoder vs BERT, initialization of image encoder, etc.
In this survey, we cover two settings (zero-shot and open-vocabulary) and six tasks (object detection, semantic/instance/panoptic segmentation, 3D scene understanding, and video understanding). We pivot on the permission to weak supervision signals and the usage of weak supervision signals to build a taxonomy that is universal across these diverse settings and tasks. The weak supervision signals can be image-text pairs or large vision-language models. Below is a general overview of each methodology.
In current literature, zero-shot and open-vocabulary are used interchangeably, however, we note their subtle differences through the evolvement from traditional zero-shot to the newly formulated open-vocabulary setting.
- Zero-Shot Object Detection
- Zero-Shot Segmentation
- Open-Vocabulary Object Detection
- Open-Vocabulary Segmentation
- Open-Vocabulary 3D Scene Understanding
- Open-Vocabulary Video Understanding
- Acknowledgement
Venue | Paper Abbr | Project |
---|---|---|
ECCV'18 | ZSDv1 | N/A |
ACCV'18 & IJCV'20 | ZSDv2 | N/A |
AAAI'20 | CA-ZSR | Code |
AAAI'19 | ZSD-TD | N/A |
ACCV'20 | BLC | Code |
ICCV'19 | TL-ZSD | N/A |
arXiv'23 | SSB | N/A |
WACV'20 | MS-Zero | N/A |
TCSVT'19 | ZS-YOLO | N/A |
AAAI'21 | DPIF | Code |
TPAMI'21 | ContrastZSD | N/A |
IJCAI'20 | ZSD-CNN | N/A |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'20 | DELO | N/A |
ACCV'20 | SU | Code |
AAAI'20 | GTNet | Code |
CVPR'22 | RRFS | Code |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'20 | SPNet | Code |
NeurIPS'20 | ULZSS | Code |
ICCV'21 | JoEm | Code |
ICCVW'19 | VM | N/A |
ICCV'21 | PMOSR | N/A |
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'19 | ZS3Net | Code |
NeurIPS'20 | CSRL | N/A |
MM'20 | CaGNet | Code |
ICCV'21 | SIGN | Code |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'21 | ZSIS | Code |
Venue | Paper Abbr | Project | Text Encoder | Prompt | Image Backbone (w/ init. method) | Detector |
---|---|---|---|---|---|---|
CVPR'21 | OVR-CNN | Code | BERT | ❌ | R50 (IN-1K) | Faster R-CNN |
GCPR'22 | LocOv | Code | BERT | ❌ | R50 (IN-1K) | Faster R-CNN |
arXiv'23 | MMC-Det | N/A | BERT | ❌ | R50 (N/A) | Faster R-CNN/CenterNetv2 |
NeurIPS'22 | DetCLIP | N/A | FILIP | T (cat+def) | Swin | ATSS |
CVPR'23 | DetCLIPv2 | N/A | FILIP | T (cat+def) | Swin | ATSS |
CVPR'24 | DetCLIPv3 | N/A | FILIP | T (cat+def) | Swin | DETR-like |
AAAI'24 | WSOVOD | Code | CLIP | T (cat) | R50 (IN-1K) | Faster R-CNN |
CVPR'23 | RO-ViT | N/A | CLIP | T (cat) | ViT (ALIGN) | Mask R-CNN |
ICCV'23 | CFM-ViT | N/A | CLIP | T (cat) | ViT (ALIGN) | Mask R-CNN |
ICCV'23 | DITO | Code | CLIP | T (cat) | ViT (CLIP, ALIGN, DataComp-1B) | Faster R-CNN |
ICLR'23 | VLDet | Code | CLIP | T (cat) | R50 (IN-1K) | Faster R-CNN/CenterNetv2 |
ICCV'23 | GOAT | N/A | CLIP | T (cat) | R50 (IN-1K/RegionCLIP) | Faster R-CNN/CenterNetv2 |
ECCV'22 | OV-DETR | Code | CLIP | T (cat) | R50 (N/A) | Def-DETR |
arXiv'23 | Prompt-OVD | N/A | CLIP | T (cat) | ViTDet (IN-1K) | Def-DETR |
CVPR'23 | CORA | N/A | CLIP | T (cat) | R50 (N/A) | SAM-DETR/CenterNetv2 |
ICCV'23 | EdaDet | Code | CLIP | T (cat) | ||
ICCV'21 | MDETR | Code | ||||
ECCV'22 | MAVL | Code | ||||
NeurIPS'24 | MQ-Det | Code | ||||
CVPR'24 | YOLO-World | Code | ||||
MM'23 | SGDN | N/A | RoBERTa | ❌ |
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
CVPR'22 | RegionCLIP | Code | CLIP | T (cat) |
ECCV'22 | VL-PLM | Code | ||
CVPR'22 | GLIP | Code | ||
NeurIPS'22 | GLIPv2 | Code | ||
arXiv'23 | Grounding-DINO | Code | ||
ECCV'22 | PromptDet | Code | CLIP | L (cat+desc) |
arXiv'23 | SAS-Det | Code | CLIP | T (cat) |
ECCV'22 | PB-OVD | Code | CLIP | T (cat) |
AAAI'24 | CLIM | Code | CLIP | T (cat) |
arXiv'22 | VTP-OVD | N/A | CLIP | T (cat) |
AAAI'24 | ProxyDet | Code | CLIP | T (cat) |
NeurIPS'23 | CoDet | Code | CLIP | T (cat) |
ECCV'22 | Detic | Code | CLIP | T (cat) |
ICML'23 | MMC | Code | CLIP | GPT-3 |
arXiv'23 | 3Ways | N/A | CLIP | T (cat) |
arXiv'23 | PLAC | N/A | CLIP | T (cat) |
arXiv'23 | PCL | N/A | ||
NeurIPS'24 | OWLv2 | Code |
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
ICLR'22 | ViLD | Code | CLIP | T (cat) |
ICDMW'22 | ZSD-YOLO | Code | CLIP | T (cat+desc) |
WACV'24 | LP-OVOD | Code | CLIP | T (cat) |
arXiv'23 | EZSD | Code | CLIP | T (cat) |
AAAI'24 | SIC-CADS | Code | CLIP | T (cat) |
CVPR'23 | BARON | Code | CLIP | T (cat) |
CVPR'23 | OADP | Code | CLIP | T (cat) |
arXiv'23 | GridCLIP | N/A | ||
NeurIPS'22 | RKDWTF | Code | CLIP | T (cat) |
ICCV'23 | DK-DETR | Code | CLIP | T (cat) |
CVPR'22 | HierKD | Code | CLIP | T (cat/desc) |
CVPR'22 | DetPro | Code | CLIP | L (cat) |
arXiv'23 | CLIPSelf | Code | CLIP | T (cat) |
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
ECCV'22 | OWL-ViT | Code | CLIP | T (cat) |
CVPR'23 | UniDetector | Code | ||
ICLR'23 | F-VLM | Code | CLIP | T (cat) |
CVPR'23 | ScaleDet | N/A | ||
ICCV'23 | OpenSeed | Code | ||
arXiv'23 | DRR | N/A | CLIP | T (cat) |
arXiv'23 | Sambor | Code |
Venue | Paper Abbr | Project |
---|---|---|
ECCV'22 | OpenSeg | N/A |
arXiv'23 | SLIC | N/A |
CVPR'22 | GroupViT | Code |
ECCV'22 | ViL-Seg | N/A |
ICML'23 | SegCLIP | Code |
CVPR'23 | OVSegmentor | Code |
CVPR'23 | PACL | N/A |
CVPR'23 | TCL | Code |
ECCV'22 | SimSeg | Code |
Venue | Paper Abbr | Project |
---|---|---|
ECCV'22 | TTD | N/A |
Venue | Paper Abbr | Project |
---|---|---|
arXiv'23 | GKC | N/A |
arXiv'23 | SAM-CLIP | N/A |
ICCV'23 | ZeroSeg | Code |
Venue | Paper Abbr | Project |
---|---|---|
ICLR'22 | LSeg | Code |
CVPR'23 | SAZS | Code |
MM'23 | CEL | N/A |
CVPR'22 | ZegFormer | Code |
NeurIPS'22 | ReCo | Project |
arXiv'23 | SCAN | N/A |
ECCV'22 | ZSSeg | Code |
ECCV'22 | MaskCLIP | Code |
arXiv'23 | CLIP-DINOiser | Code |
PRCV'23 | MVP-SEG | N/A |
arXiv'23 | OVDiff | Project |
WACV'24 | FOSSIL | N/A |
NeurIPS'24 | POMP | Code |
NeurIPS'24 | AttrSeg | N/A |
arXiv'23 | PnP-OVSS | Code |
arXiv'23 | TagAlign | Project |
arXiv'23 | SelfSeg | N/A |
CVPR'22 | DenseCLIP | Code |
CVPR'23 | OVSeg | Code |
arXiv'23 | CAT-Seg | Code |
arXiv'23 | SED | Code |
NeurIPS'23 | MAFT | Code |
arXiv'23 | TagCLIP | N/A |
CVPR'23 | ZegCLIP | Code |
CVPR'22 | CLIPSeg | Code |
CVPR'23 | SAN | Code |
arXiv'23 | CLIP Surgery | Code |
arXiv'23 | CaR | Project |
Venue | Paper Abbr | Project |
---|---|---|
ICCV'23 | CGG | Code |
CVPR'23 | D2Zero | Code |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | XPM | Code |
CVPR'23 | Mask-free OVIS | Code |
arXiv'23 | MosaicFusion | Code |
Venue | Paper Abbr | Project |
---|---|---|
arXiv'24 | OV-SAM | Code |
Venue | Paper Abbr | Project |
---|---|---|
arXiv'24 | Uni-OVSeg | Code |
CVPR'23 | X-Decoder | Code |
CVPR'24 | APE | Code |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | PADing | Code |
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'23 | FC-CLIP | Code |
CVPR'23 | FreeSeg | Project |
arXiv'24 | PosSAM | Project |
ICCV'23 | MasQCLIP | Project |
CVPR'23 | OMG-Seg | Code |
arXiv'23 | Semantic-SAM | Code |
CVPR'23 | ODISE | Code |
NeurIPS'23 | HIPIE | Code |
ICML'23 | MaskCLIP | Project |
ICCV'23 | OPSNet | N/A |
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | OV-3DET | Code |
AAAI'24 | FM-OV3D | Code |
arXiv'23 | OpenSight | N/A |
NeurIPS'23 | CoDA | Code |
arXiv'23 | L3Det | N/A |
Venue | Paper Abbr | Project |
---|---|---|
arXiv'21 | SeCondPoint | N/A |
3DV'21 | 3DGenZ | Code |
CVPR'23 | OpenScene | Project |
CVPR'23 | PLA | Code |
arXiv'23 | RegionPLC | Project |
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'23 | OpenMask3D | Project |
CVPR'24 | MaskClustering | Project |
arXiv'23 | OpenIns3D | Project |
arXiv'23 | Open3DIS | Project |
Venue | Paper Abbr | Project |
---|---|---|
ICCV'23 | OV2Seg | Code |
arXiv'23 | OpenVIS | Code |
arXiv'24 | BriVIS | Code |
If you find our survey helpful, please consider citing our paper:
@article{survey-ovd-ovs,
title={A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future},
author={Chaoyang Zhu and Long Chen},
journal={arXiv preprint arXiv:2307.09220},
year={2023}
}