This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey
[Paper]
Feel free to contact us or pull requests if you find any related papers that are not included here.
Last update on 2023/8/21
- [ICCV 2023] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models [Paper][Code]
- [ICCV 2023] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? [Paper][Code]
- [ICCV 2023] PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization [Paper][Code]
- [ICCV 2023] Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models [Paper]
- [ICCV 2023] PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation [Paper]
- [ICCVW 2023] AD-CLIP: Adapting Domains in Prompt Space Using CLIP [Paper]
- [arXiv 2023] Improving Pseudo Labels for Open-Vocabulary Object Detection [Paper]
- [ICCV 2023] SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning [Paper][Code]
- [arXiv 2023] ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation [Paper]
- [arXiv 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP [Paper][Code]
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
If you find our work useful in your research, please consider citing:
@article{zhang2023vision,
title={Vision-Language Models for Vision Tasks: A Survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={arXiv preprint arXiv:2304.00685},
year={2023}
}
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
Dataset | Year | Num of Image-Text Paris | Language | Project |
---|---|---|---|---|
SBU Caption | 2011 | 1M | English | Project |
COCO Caption | 2016 | 1.5M | English | Project |
Yahoo Flickr Creative Commons 100 Million | 2016 | 100M | English | Project |
Visual Genome | 2017 | 5.4M | English | Project |
Conceptual Captions 3M | 2018 | 3.3M | English | Project |
Localized Narratives | 2020 | 0.87M | English | Project |
Conceptual 12M | 2021 | 12M | English | Project |
Wikipedia-based Image Text | 2021 | 37.6M | 108 Languages | Project |
Red Caps | 2021 | 12M | English | Project |
LAION400M | 2021 | 400M | English | Project |
LAION5B | 2022 | 5B | Over 100 Languages | Project |
WuKong | 2022 | 100M | Chinese | Project |
CLIP | 2021 | 400M | English | - |
ALIGN | 2021 | 1.8B | English | - |
FILIP | 2021 | 300M | English | - |
WebLI | 2022 | 12B | English | - |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
MNIST | 1998 | 10 | 60,000 | 10,000 | Accuracy | Project |
Caltech-101 | 2004 | 102 | 3,060 | 6,085 | Mean Per Class | Project |
PASCAL VOC 2007 | 2007 | 20 | 5,011 | 4,952 | 11-point mAP | Project |
Oxford 102 Flowers | 2008 | 102 | 2,040 | 6,149 | Mean Per Class | Project |
CIFAR-10 | 2009 | 10 | 50,000 | 10,000 | Accuracy | Project |
CIFAR-100 | 2009 | 100 | 50,000 | 10,000 | Accuracy | Project |
ImageNet-1k | 2009 | 1000 | 1,281,167 | 50,000 | Accuracy | Project |
SUN397 | 2010 | 397 | 19,850 | 19,850 | Accuracy | Project |
SVHN | 2011 | 10 | 73,257 | 26,032 | Accuracy | Project |
STL-10 | 2011 | 10 | 1,000 | 8,000 | Accuracy | Project |
GTSRB | 2011 | 43 | 26,640 | 12,630 | Accuracy | Project |
KITTI Distance | 2012 | 4 | 6,770 | 711 | Accuracy | Project |
IIIT5k | 2012 | 36 | 2,000 | 3,000 | Accuracy | Project |
Oxford-IIIT PETS | 2012 | 37 | 3,680 | 3,669 | Mean Per Class | Project |
Stanford Cars | 2013 | 196 | 8,144 | 8,041 | Accuracy | Project |
FGVC Aircraft | 2013 | 100 | 6,667 | 3,333 | Mean Per Class | Project |
Facial Emotion | 2013 | 8 | 32,140 | 3,574 | Accuracy | Project |
Rendered SST2 | 2013 | 2 | 7,792 | 1,821 | Accuracy | Project |
Describable Textures | 2014 | 47 | 3,760 | 1,880 | Accuracy | Project |
Food-101 | 2014 | 101 | 75,750 | 25,250 | Accuracy | Project |
Birdsnap | 2014 | 500 | 42,283 | 2,149 | Accuracy | Project |
RESISC45 | 2017 | 45 | 3,150 | 25,200 | Accuracy | Project |
CLEVR Counts | 2017 | 8 | 2,000 | 500 | Accuracy | Project |
PatchCamelyon | 2018 | 2 | 294,912 | 32,768 | Accuracy | Project |
EuroSAT | 2019 | 10 | 10,000 | 5,000 | Accuracy | Project |
Hateful Memes | 2020 | 2 | 8,500 | 500 | ROC AUC | Project |
Country211 | 2021 | 211 | 43,200 | 21,100 | Accuracy | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
Flickr30k | 2014 | - | 31,783 | - | Recall | Project |
COCO Caption | 2015 | - | 82,783 | 5,000 | Recall | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
UCF101 | 2012 | 101 | 9,537 | 1,794 | Accuracy | Project |
Kinetics700 | 2019 | 700 | 494,801 | 31,669 | Mean (top1, top5) | Project |
RareAct | 2020 | 122 | 7,607 | - | mWAP, mSAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,000 | 5,000 | Box mAP | Project |
LVIS | 2019 | 1203 | 118,000 | 5,000 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |
Paper | Published in | Code/Project |
---|---|---|
FLAVA: A Foundational Language And Vision Alignment Model | CVPR 2022 | Code |
CoCa: Contrastive Captioners are Image-Text Foundation Models | arXiv 2022 | Code |
Too Large; Data Reduction for Vision-Language Pre-Training | arXiv 2023 | Code |
SAM: Segment Anything | arXiv 2023 | Code |
SEEM: Segment Everything Everywhere All at Once | arXiv 2023 | Code |
Semantic-SAM: Segment and Recognize Anything at Any Granularity | arXiv 2023 | Code |
Paper | Published in | Code/Project |
---|---|---|
GLIP: Grounded Language-Image Pre-training | CVPR 2022 | Code |
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection | NeurIPS 2022 | - |
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training | CVPR 2023 | Code |
Paper | Published in | Code/Project |
---|---|---|
Exploring Visual Prompts for Adapting Large-Scale Models | arXiv 2022 | Code |
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification | arXiv 2023 | - |
Fine-Grained Visual Prompting | arXiv 2023 | - |
Paper | Published in | Code/Project |
---|---|---|
UPT: Unified Vision and Language Prompt Learning | arXiv 2022 | Code |
MVLPT: Multitask Vision-Language Prompt Tuning | arXiv 2022 | Code |
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model | arXiv 2022 | Code |
MaPLe: Multi-modal Prompt Learning | CVPR 2023 | Code |