Awesome-LLMs-for-Video-Understanding

🔥Video Understanding with Large Language Models: A Survey

Table of Contents

Awesome-LLMs-for-Video-Understanding

😎 Vid-LLMs: Models

🤖 LLM-based Video Agents

Title	Model	Date	Code	Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Socratic Models	04/2022	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
VLog: Video as a Long Document	VLog	04/2023	code	-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv
MISAR: A Multimodal Instructional System with Augmented Reality	MISAR	10/2023	project page	ICCV

👾 Vid-LLM Pretraining

Title	Model	Date	Code	Venue
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	-	arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title	Model	Date	Code	Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	code	arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Video-LLaVA	11/2023	code	arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Chat-UniVi	11/2023	code	arXiv
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description	AutoAD II	10/2023	-	ICCV
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models	FAVOR	10/2023	code	arXiv

Fine-tuning with Insertive Adapters

Title	Model	Date	Code	Venue
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Otter	06/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

Fine-tuning with Hybrid Adapters

Title	Model	Date	Code	Venue
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	GPT4Video	11/2023	-	arXiv

🦾 Hybrid Methods

Title	Model	Date	Code	Venue
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code demo	arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	TimeChat	12/2023	code	arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Title	Date	Code	Data	Venue

Captioning and Description

Grounding and Retrieval

Question Answering

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
      url={http://arxiv.org/abs/2312.17432}
}

About

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.