There are 26 repositories under vision-and-language topic.
LAVIS - A One-stop Library for Language-Vision Intelligence
A one stop repository for generative AI research updates, interview resources, notebooks and much more!
Code for ALBEF: a new vision-language pre-training method
Multimodal-GPT
Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
日本語LLMまとめ - Overview of Japanese LLMs
My Reading Lists of Deep Learning and Natural Language Processing
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
This repository is a curated collection of the most exciting and influential CVPR 2023 papers. 🔥 [Paper + Code]
This repository is a curated collection of the most exciting and influential CVPR 2024 papers. 🔥 [Paper + Code + Demo]
Creating a software for automatic monitoring in online proctoring
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
AI Research Platform for Reinforcement Learning from Real Panoramic Images.
[ECCV 2024] PointLLM: Empowering Large Language Models to Understand Point Clouds
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
A curated list of awesome vision and language resources (still under construction... stay tuned!)
The Paper List of Large Multi-Modality Model, Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
A Gradio demo of MGIE
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"
Recent Advances in Vision and Language Pre-training (VLP)
Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]