If you find this project helpful, please consider giving it a star ⭐.
The model is trained on fully-supervised semantic segmentation datasets with pixel-level annotations (e.g., COCO Stuff dataset).
- [LSeg] | ICLR'22 | Language-driven Semantic Segmentation |
[pdf]
|[code]
- [OpenSeg] | ECCV'22 | Scaling Open-vocabulary Image Segmentation with Image-level Labels |
[pdf]
|[code]
- [Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model |
[pdf]
|[code]
- [SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [MaskCLIP] | ICML'23 | Open-Vocabulary Universal Image Segmentation with MaskCLIP |
[pdf]
|[code]
- [OVSeg] | CVPR'23 | Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP |
[pdf]
|[code]
- [X-Decoder] | CVPR'23 | Generalized Decoding for Pixel, Image, and Language |
[pdf]
|[code]
- [SAN] | CVPR'23(Highlight) | Side Adapter Network for Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [SAN] | TAPMI'23 | SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation |
[pdf]
|[code]
- [ODISE] | CVPR'23 | Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models |
[pdf]
|[code]
- [FreeSeg] | CVPR'23 | FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation |
[pdf]
|[code]
- [CAT-Seg] | Arxiv'23.03 | CAT-Seg : Cost Aggregation for Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [OpenSeeD] | ICCV'23 | A Simple Framework for Open-Vocabulary Segmentation and Detection |
[pdf]
|[code]
- [GKC] | ICCV'23 | Global Knowledge Calibration for Fast Open-Vocabulary Segmentation |
[pdf]
- [OPSNet] | ICCV'23 | Open-vocabulary Panoptic Segmentation with Embedding Modulation |
[pdf]
|[code]
- [MasQCLIP] | ICCV'23 | MasQCLIP for Open-Vocabulary Universal Image Segmentation |
[pdf]
- [DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network |
[pdf]
|[code]
- [Li et al.] | ICCV'23 | Open-vocabulary Object Segmentation with Diffusion Models |
[pdf]
|[code]
- [HIPIE] | NeurIPS'23 | Hierarchical Open-vocabulary Universal Image Segmentation |
[pdf]
|[code]
- [FC-CLIP] | NeurIPS'23 | Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP |
[pdf]
|[code]
- [MAFT] | NeurIPS'23 | Learning Mask-aware CLIP Representations for Zero-Shot Segmentation |
[pdf]
|[code]
- [Dao et al] | TMM | Class Enhancement Losses with Pseudo Labels for Open-Vocabulary Semantic Segmentation |
[pdf]
- [ADA] | Arxiv'23.09 | Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation |
[pdf]
- [SED] | Arxiv'23.11 | SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation |
[pdf]
- [SELF-SEG] | Arixv'23.12 | Self-Guided Open-Vocabulary Semantic Segmentation |
[pdf]
- [SCAN] | Arixv'23.12 | Open-Vocabulary Segmentation with Semantic-Assisted Calibration |
[pdf]
|[code]
- [OpenSD] | Arixv'23.12 | OpenSD: Unified Open-Vocabulary Segmentation and Detection |
[pdf]
|[code]
[text-supervised/language-supervised] The model is trained on weakly supervised datasets with only image-level annotations/captions (e.g., CC12M dataset).
- [GroupViT] | CVPR'22 | GroupViT: Semantic Segmentation Emerges from Text Supervision |
[pdf]
|[code]
- [ViL-Seg] | ECCV'22 | Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding |
[pdf]
- [MaskCLIP+] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP |
[pdf]
|[code]
- [ViewCo] | ICLR'23 | Viewco: Discovering Text-supervised Segmentation Masks via Multi-view Semantic Consistency |
[pdf]
- [SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [CLIP-S4] | CVPR'23 | CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation |
[pdf]
- [PACL] | CVPR'23 | Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning |
[pdf]
- [OVSegmentor] | CVPR'23 | Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision |
[pdf]
|[code]
- [SimSeg] | CVPR'23 | A Simple Framework for Text-Supervised Semantic Segmentation |
[pdf]
|[code]
- [TCL] | CVPR'23 | Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs |
[pdf]
|[code]
- [SimCon] | Arxiv'23.02 | SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation |
[pdf]
- [Zhang et al.] | Arxiv'23.04 | Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation |
[pdf]
- [ZeroSeg] | ICCV'23 | Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only |
[pdf]
- [CLIPpy] | ICCV'23 | Perceptual Grouping in Contrastive Vision-Language Models |
[pdf]
- [MixReorg] | ICCV'23 | MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation |
[pdf]
- [CoCu] | NeurIPS'23 | Bridging Semantic Gaps for Language-Supervised Semantic Segmentation |
[pdf]
|[code]
- [PGSeg] | NeurIPS'23 | Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [SAM-CLIP] | Arixv'23.10 | SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding |
[pdf]
- [CLIP-DINOiser] | Arixv'23.12 | CLIP-DINOiser: Teaching CLIP a few DINO tricks |
[pdf]
|[code]
- [TagAlign] | Arixv'23.12 | TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification |
[pdf]
|[code]
- [S-Seg] | Arixv'24.01 | Exploring Simple Open-Vocabulary Semantic Segmentation |
[pdf]
|[code]
- [CLIPSelf] | ICLR'24(Spotlight) | CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction |
[pdf]
|[code]
- [Uni-OVSeg] | Arixv'24.02 | Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision |
[pdf]
The model is modified from the off-the-shelf large models (e.g., CLIP, Diffusion models) without an additional training phase.
- [MaskCLIP] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP |
[pdf]
|[code]
- [ReCo] | NeurIPS'22 | ReCo: Retrieve and Co-segment for Zero-shot Transfer |
[pdf]
|[code]
- [CLIP Surgery] | Arxiv'23.04 | CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks |
[pdf]
|[code]
- [OVDiff] | Arxiv'23.06 | Diffusion Models for Zero-Shot Open-Vocabulary Segmentation |
[pdf]
- [DiffSegmenter] | Arxiv'23.09 | Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter |
[pdf]
- [CLIP-DIY] | WACV'24 | CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free |
[pdf]
- [IPSeg] | Arxiv'23.10 | Towards Training-free Open-world Segmentation via Image Prompting Foundation Models |
[pdf]
- [PnP-OVSS] | Arxiv'23.11 | Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models |
[pdf]
- [SCLIP] | Arxiv'23.12 | SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference |
[pdf]
- [GEM] | Arxiv'23.12 | Grounding Everything: Emerging Localization Properties in Vision-Language Transformers |
[pdf]
|[code]
- [CaR] | Arxiv'23.12 | CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
[pdf]
|[code]
- [FOSSIL] | WACV'24 | FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval |
[pdf]
- [TagCLIP] | AAAI'24 | TagCLIP: A Local-to-Global Framework to Enhance Open-VocabularyMulti-Label Classification of CLIP Without Training |
[pdf]
|[code]
- [Zhou et al.] | Arxiv'23.11 | Rethinking Evaluation Metrics of Open-Vocabulary Segmentation |
[pdf]
|[code]
Different from open-vocabulary segmentation (cross-dataset), zero-shot methods split each dataset to seen classes and unseen classes.
- [ZegFormer] | CVPR'22 | ZegFormer: Decoupling Zero-Shot Semantic Segmentation |
[pdf]
|[code]
- [Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model |
[pdf]
|[code]
- [ZegCLIP] | CVPR'23 | ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation |
[pdf]
|[code]
- [PADing] | CVPR'23 | Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation |
[pdf]
|[code]
- [DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network |
[pdf]
|[code]
- [SPT] | AAAI'24 | Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation |
[pdf]
|[code]
- [CARIS] | ACM MM'23 | CARIS: Context-Aware Referring Image Segmentation |
[pdf]
|[code]
- [BKINet] | TMM'23 | Bilateral Knowledge Interaction Network for Referring Image Segmentation |
[pdf]
|[code]
- [Group-RES] | ICCV'23 | Advancing Referring Expression Segmentation Beyond Single Image |
[pdf]
|[code]
- [RIS-DMMI] | ICCV'23 | Beyond One-to-One: Rethinking the Referring Image Segmentation |
[pdf]
|[code]
- [ETRIS] | ICCV'23 | Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation |
[pdf]
|[code]
- [SEEM] | ArXiv'23.04 | Segment Everything Everywhere All at Once |
[pdf]
|[code]
- [Kim et al.] | ICCV'23 | Shatter and Gather: Learning Referring Image Segmentation with Text Supervision |
[pdf]
|[code]
- [Liu et al.] | ICCV'23 | Referring Image Segmentation Using Text Supervision |
[pdf]
|[code]
- [RO-ViT] | CVPR'23(Highlight) | Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers |
[pdf]
|[code]
- [CAT] | CVPR'23 | CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection |
[pdf]
|[code]
- [DetCLIPv2] | CVPR'23 | DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment |
[pdf]
- [CondHead] | CVPR'23 | Learning to Detect and Segment for Open Vocabulary Object Detection |
[pdf]
- [CORA] | CVPR'23 | CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching |
[pdf]
|[code]
- [ovdet] | CVPR'23 | Aligning Bag of Regions for Open-Vocabulary Object Detection |
[pdf]
|[code]
- [OADP] | CVPR'23 | Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection |
[pdf]
|[code]
- [F-VLM] | ICLR'23 | F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models |
[pdf]
|[code]
- [mm-ovod] | ICML 2023 | Multi-Modal Classifiers for Open-Vocabulary Object Detection |
[pdf]
|[code]
- [SGDN] | Arxiv'23.07 | Open-Vocabulary Object Detection via Scene Graph Discovery |
[pdf]
- [MMC-Det] | Arxiv'23.08 | Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection |
[pdf]
- [IPL] | Arxiv'23.08 | Improving Pseudo Labels for Open-Vocabulary Object Detection |
[pdf]
- [DITO] | Arxiv'23.09 | Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection |
[pdf]
|[code]
- [EdaDet] | ICCV'23 | EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment |
[pdf]
|[code]
- [LP-OVOD] | WACV'24 | LP-OVOD: Open-Vocabulary Object Detection by Linear Probing |
[pdf]
|[code]
- [DST-Det] | Arxiv'23.10 | DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection |
[pdf]
- [CoDet] | NeurIPS'23 | CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection |
[pdf]
|[code]
- [PLAC] | Arxiv'23.12 | Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection |
[pdf]
- [Sambor] | Arxiv'23.12 | Boosting Segment Anything Model Towards Open-Vocabulary Learning |
[pdf]
|[code]
- [DVDet] | ICLR'24 | LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors |
[pdf]
- [Semantic-SAM] | Arxiv'23.10 | Semantic-SAM: Segment and Recognize Anything at Any Granularity |
[pdf]
|[code]
- [Open-Vocabulary SAM] | Arxiv'24.01 | Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively |
[pdf]
|[code]
- [OMG-Seg] | Arxiv'24.01 | OMG-Seg: Is One Model Good Enough For All Segmentation? |
[pdf]
|[code]
- Towards Open Vocabulary Learning: A Survey |
[pdf]
- A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future |
[pdf]
If you have any suggestions or find missing papers, please don't hesitate to contact me via lydyc@mail.ustc.edu.cn.