3d-scene-understanding instance-segmentation object-detection open-vocabulary panoptic-segmentation semantic-segmentation video-understanding zero-shot

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Chaoyang Zhu, Long Chen^*

News

05/06/2024: Our 2nd version manuscript is accepted by TPAMI.

✨ PR is welcome!

Please remain tuned as this repo will be maintained on a week-to-week basis.

Todo

Add detailed impls of each method, such as template prompts vs learnable prompts, CLIP text encoder vs BERT, initialization of image encoder, etc.

General Overview

In this survey, we cover two settings (zero-shot and open-vocabulary) and six tasks (object detection, semantic/instance/panoptic segmentation, 3D scene understanding, and video understanding). We pivot on the permission to weak supervision signals and the usage of weak supervision signals to build a taxonomy that is universal across these diverse settings and tasks. The weak supervision signals can be image-text pairs or large vision-language models. Below is a general overview of each methodology.

In current literature, zero-shot and open-vocabulary are used interchangeably, however, we note their subtle differences through the evolvement from traditional zero-shot to the newly formulated open-vocabulary setting.

Zero-Shot Object Detection
- Visual-Semantic Space Mapping
- Novel Visual Feature Synthesis
Zero-Shot Segmentation
- Zero-Shot Semantic Segmentation
  - Visual-Semantic Space Mapping
  - Novel Visual Feature Synthesis
- Zero-Shot Instance Segmentation
Open-Vocabulary Object Detection
Open-Vocabulary Segmentation
Open-Vocabulary 3D Scene Understanding
- Open-Vocabulary 3D Detection
- Open-Vocabulary 3D Segmentation
  - Open-Vocabulary 3D Semantic Segmentation
  - Open-Vocabulary 3D Instance Segmentation
Open-Vocabulary Video Understanding
- Open-Vocabulary Video Instance Segmentation
Acknowledgement

Zero-Shot Object Detection

Visual-Semantic Space Mapping

Venue	Paper Abbr	Project
ECCV'18	ZSDv1	N/A
ACCV'18 & IJCV'20	ZSDv2	N/A
AAAI'20	CA-ZSR	Code
AAAI'19	ZSD-TD	N/A
ACCV'20	BLC	Code
ICCV'19	TL-ZSD	N/A
arXiv'23	SSB	N/A
WACV'20	MS-Zero	N/A
TCSVT'19	ZS-YOLO	N/A
AAAI'21	DPIF	Code
TPAMI'21	ContrastZSD	N/A
IJCAI'20	ZSD-CNN	N/A

Novel Visual Feature Synthesis

Venue	Paper Abbr	Project
CVPR'20	DELO	N/A
ACCV'20	SU	Code
AAAI'20	GTNet	Code
CVPR'22	RRFS	Code

Zero-Shot Segmentation

Zero-Shot Semantic Segmentation

Visual-Semantic Space Mapping

Venue	Paper Abbr	Project
CVPR'20	SPNet	Code
NeurIPS'20	ULZSS	Code
ICCV'21	JoEm	Code
ICCVW'19	VM	N/A
ICCV'21	PMOSR	N/A

Novel Visual Feature Synthesis

Venue	Paper Abbr	Project
NeurIPS'19	ZS3Net	Code
NeurIPS'20	CSRL	N/A
MM'20	CaGNet	Code
ICCV'21	SIGN	Code

Zero-Shot Instance Segmentation

Venue	Paper Abbr	Project
CVPR'21	ZSIS	Code

Open-Vocabulary Object Detection

Region-Aware Training

Venue	Paper Abbr	Project	Text Encoder	Prompt	Image Backbone (w/ init. method)	Detector
CVPR'21	OVR-CNN	Code	BERT	❌	R50 (IN-1K)	Faster R-CNN
GCPR'22	LocOv	Code	BERT	❌	R50 (IN-1K)	Faster R-CNN
arXiv'23	MMC-Det	N/A	BERT	❌	R50 (N/A)	Faster R-CNN/CenterNetv2
NeurIPS'22	DetCLIP	N/A	FILIP	T (cat+def)	Swin	ATSS
CVPR'23	DetCLIPv2	N/A	FILIP	T (cat+def)	Swin	ATSS
CVPR'24	DetCLIPv3	N/A	FILIP	T (cat+def)	Swin	DETR-like
AAAI'24	WSOVOD	Code	CLIP	T (cat)	R50 (IN-1K)	Faster R-CNN
CVPR'23	RO-ViT	N/A	CLIP	T (cat)	ViT (ALIGN)	Mask R-CNN
ICCV'23	CFM-ViT	N/A	CLIP	T (cat)	ViT (ALIGN)	Mask R-CNN
ICCV'23	DITO	Code	CLIP	T (cat)	ViT (CLIP, ALIGN, DataComp-1B)	Faster R-CNN
ICLR'23	VLDet	Code	CLIP	T (cat)	R50 (IN-1K)	Faster R-CNN/CenterNetv2
ICCV'23	GOAT	N/A	CLIP	T (cat)	R50 (IN-1K/RegionCLIP)	Faster R-CNN/CenterNetv2
ECCV'22	OV-DETR	Code	CLIP	T (cat)	R50 (N/A)	Def-DETR
arXiv'23	Prompt-OVD	N/A	CLIP	T (cat)	ViTDet (IN-1K)	Def-DETR
CVPR'23	CORA	N/A	CLIP	T (cat)	R50 (N/A)	SAM-DETR/CenterNetv2
ICCV'23	EdaDet	Code	CLIP	T (cat)
ICCV'21	MDETR	Code
ECCV'22	MAVL	Code
NeurIPS'24	MQ-Det	Code
CVPR'24	YOLO-World	Code
MM'23	SGDN	N/A	RoBERTa	❌

Pseudo-Labeling

Venue	Paper Abbr	Project	Text Encoder	Prompt
CVPR'22	RegionCLIP	Code	CLIP	T (cat)
ECCV'22	VL-PLM	Code
CVPR'22	GLIP	Code
NeurIPS'22	GLIPv2	Code
arXiv'23	Grounding-DINO	Code
ECCV'22	PromptDet	Code	CLIP	L (cat+desc)
arXiv'23	SAS-Det	Code	CLIP	T (cat)
ECCV'22	PB-OVD	Code	CLIP	T (cat)
AAAI'24	CLIM	Code	CLIP	T (cat)
arXiv'22	VTP-OVD	N/A	CLIP	T (cat)
AAAI'24	ProxyDet	Code	CLIP	T (cat)
NeurIPS'23	CoDet	Code	CLIP	T (cat)
ECCV'22	Detic	Code	CLIP	T (cat)
ICML'23	MMC	Code	CLIP	GPT-3
arXiv'23	3Ways	N/A	CLIP	T (cat)
arXiv'23	PLAC	N/A	CLIP	T (cat)
arXiv'23	PCL	N/A
NeurIPS'24	OWLv2	Code

Knowledge Distillation

Venue	Paper Abbr	Project	Text Encoder	Prompt
ICLR'22	ViLD	Code	CLIP	T (cat)
ICDMW'22	ZSD-YOLO	Code	CLIP	T (cat+desc)
WACV'24	LP-OVOD	Code	CLIP	T (cat)
arXiv'23	EZSD	Code	CLIP	T (cat)
AAAI'24	SIC-CADS	Code	CLIP	T (cat)
CVPR'23	BARON	Code	CLIP	T (cat)
CVPR'23	OADP	Code	CLIP	T (cat)
arXiv'23	GridCLIP	N/A
NeurIPS'22	RKDWTF	Code	CLIP	T (cat)
ICCV'23	DK-DETR	Code	CLIP	T (cat)
CVPR'22	HierKD	Code	CLIP	T (cat/desc)
CVPR'22	DetPro	Code	CLIP	L (cat)
arXiv'23	CLIPSelf	Code	CLIP	T (cat)

Transfer Learning

Venue	Paper Abbr	Project	Text Encoder	Prompt
ECCV'22	OWL-ViT	Code	CLIP	T (cat)
CVPR'23	UniDetector	Code
ICLR'23	F-VLM	Code	CLIP	T (cat)
CVPR'23	ScaleDet	N/A
ICCV'23	OpenSeed	Code
arXiv'23	DRR	N/A	CLIP	T (cat)
arXiv'23	Sambor	Code

Open-Vocabulary Segmentation

Open-Vocabulary Semantic Segmentation

Region-Aware Training

Venue	Paper Abbr	Project
ECCV'22	OpenSeg	N/A
arXiv'23	SLIC	N/A
CVPR'22	GroupViT	Code
ECCV'22	ViL-Seg	N/A
ICML'23	SegCLIP	Code
CVPR'23	OVSegmentor	Code
CVPR'23	PACL	N/A
CVPR'23	TCL	Code
ECCV'22	SimSeg	Code

Pseudo-Labeling

Venue	Paper Abbr	Project
ECCV'22	TTD	N/A

Knowledge Distillation

Venue	Paper Abbr	Project
arXiv'23	GKC	N/A
arXiv'23	SAM-CLIP	N/A
ICCV'23	ZeroSeg	Code

Transfer Learning

Venue	Paper Abbr	Project
ICLR'22	LSeg	Code
CVPR'23	SAZS	Code
MM'23	CEL	N/A
CVPR'22	ZegFormer	Code
NeurIPS'22	ReCo	Project
arXiv'23	SCAN	N/A
ECCV'22	ZSSeg	Code
ECCV'22	MaskCLIP	Code
arXiv'23	CLIP-DINOiser	Code
PRCV'23	MVP-SEG	N/A
arXiv'23	OVDiff	Project
WACV'24	FOSSIL	N/A
NeurIPS'24	POMP	Code
NeurIPS'24	AttrSeg	N/A
arXiv'23	PnP-OVSS	Code
arXiv'23	TagAlign	Project
arXiv'23	SelfSeg	N/A
CVPR'22	DenseCLIP	Code
CVPR'23	OVSeg	Code
arXiv'23	CAT-Seg	Code
arXiv'23	SED	Code
NeurIPS'23	MAFT	Code
arXiv'23	TagCLIP	N/A
CVPR'23	ZegCLIP	Code
CVPR'22	CLIPSeg	Code
CVPR'23	SAN	Code
arXiv'23	CLIP Surgery	Code
arXiv'23	CaR	Project

Open-Vocabulary Instance Segmentation

Region-Aware Training

Venue	Paper Abbr	Project
ICCV'23	CGG	Code
CVPR'23	D2Zero	Code

Pseudo-Labeling

Venue	Paper Abbr	Project
CVPR'23	XPM	Code
CVPR'23	Mask-free OVIS	Code
arXiv'23	MosaicFusion	Code

Knowledge Distillation

Venue	Paper Abbr	Project
arXiv'24	OV-SAM	Code

Open-Vocabulary Panoptic Segmentation

Region-Aware Training

Venue	Paper Abbr	Project
arXiv'24	Uni-OVSeg	Code
CVPR'23	X-Decoder	Code
CVPR'24	APE	Code

Knowledge Distillation

Venue	Paper Abbr	Project
CVPR'23	PADing	Code

Transfer Learning

Venue	Paper Abbr	Project
NeurIPS'23	FC-CLIP	Code
CVPR'23	FreeSeg	Project
arXiv'24	PosSAM	Project
ICCV'23	MasQCLIP	Project
CVPR'23	OMG-Seg	Code
arXiv'23	Semantic-SAM	Code
CVPR'23	ODISE	Code
NeurIPS'23	HIPIE	Code
ICML'23	MaskCLIP	Project
ICCV'23	OPSNet	N/A

Open-Vocabulary 3D Scene Understanding

Open-Vocabulary 3D Detection

Venue	Paper Abbr	Project
CVPR'23	OV-3DET	Code
AAAI'24	FM-OV3D	Code
arXiv'23	OpenSight	N/A
NeurIPS'23	CoDA	Code
arXiv'23	L3Det	N/A

Open-Vocabulary 3D Segmentation

Open-Vocabulary 3D Semantic Segmentation

Venue	Paper Abbr	Project
arXiv'21	SeCondPoint	N/A
3DV'21	3DGenZ	Code
CVPR'23	OpenScene	Project
CVPR'23	PLA	Code
arXiv'23	RegionPLC	Project

Open-Vocabulary 3D Instance Segmentation

Venue	Paper Abbr	Project
NeurIPS'23	OpenMask3D	Project
CVPR'24	MaskClustering	Project
arXiv'23	OpenIns3D	Project
arXiv'23	Open3DIS	Project

Open-Vocabulary Video Understanding

Open-Vocabulary Video Instance Segmentation

Venue	Paper Abbr	Project
ICCV'23	OV2Seg	Code
arXiv'23	OpenVIS	Code
arXiv'24	BriVIS	Code

Acknowledgement

If you find our survey helpful, please consider citing our paper:

@article{survey-ovd-ovs,
    title={A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future},
    author={Chaoyang Zhu and Long Chen},
    journal={arXiv preprint arXiv:2307.09220},
    year={2023}
}

About

Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

https://arxiv.org/abs/2307.09220

3d-scene-understanding instance-segmentation object-detection open-vocabulary panoptic-segmentation semantic-segmentation video-understanding zero-shot

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

News

✨ PR is welcome!

Todo

General Overview

Table of Contents

Zero-Shot Object Detection

Visual-Semantic Space Mapping

Novel Visual Feature Synthesis

Zero-Shot Segmentation

Zero-Shot Semantic Segmentation

Visual-Semantic Space Mapping

Novel Visual Feature Synthesis

Zero-Shot Instance Segmentation

Open-Vocabulary Object Detection

Region-Aware Training

Pseudo-Labeling

Knowledge Distillation

Transfer Learning

Open-Vocabulary Segmentation

Open-Vocabulary Semantic Segmentation

Region-Aware Training

Pseudo-Labeling

Knowledge Distillation

Transfer Learning

Open-Vocabulary Instance Segmentation

Region-Aware Training

Pseudo-Labeling

Knowledge Distillation

Open-Vocabulary Panoptic Segmentation

Region-Aware Training

Knowledge Distillation

Transfer Learning

Open-Vocabulary 3D Scene Understanding

Open-Vocabulary 3D Detection

Open-Vocabulary 3D Segmentation

Open-Vocabulary 3D Semantic Segmentation

Open-Vocabulary 3D Instance Segmentation

Open-Vocabulary Video Understanding

Open-Vocabulary Video Instance Segmentation

Acknowledgement

About