RenShuhuai-Andy / Awesome-LLMs-for-Video-Understanding

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome-LLMs-for-Video-Understanding Awesome

🔥Video Understanding with Large Language Models: A Survey

Paper | Project Page

image

Table of Contents

😎 Vid-LLMs: Models

image

🤖 LLM-based Video Agents

Title Model Date Code Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language   Socratic Models 04/2022 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner  04/2023 code arXiv
VLog: Video as a Long DocumentStar   VLog 04/2023 code -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStar MISAR 10/2023 project page ICCV

👾 Vid-LLM Pretraining

Title Model Date Code Venue
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar   VAST 05/2023 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 - arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title Model Date Code Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star Video-LLaMA 06/2023 code arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStar   VALLEY 06/2023 code -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStar   Video-ChatGPT 06/2023 code arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStar   Macaw-LLM 06/2023 code arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning Star LLMVA-GEBC 06/2023 code CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star mPLUG-video 06/2023 code arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStar MovieChat 07/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStar LLaMA-VQA 10/2023 code EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStar Video-LLaVA 11/2023 code arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStar Chat-UniVi 11/2023 code arXiv
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description AutoAD II 10/2023 - ICCV
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStar FAVOR 10/2023 code arXiv

Fine-tuning with Insertive Adapters

Title Model Date Code Venue
MIMIC-IT: Multi-Modal In-Context Instruction TuningStar Otter  06/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStar   VideoLLM 05/2023 code arXiv

Fine-tuning with Hybrid Adapters

Title Model Date Code Venue
VTimeLLM: Empower LLM to Grasp Video MomentsStar VTimeLLM 11/2023 code arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video 11/2023 - arXiv

🦾 Hybrid Methods

Title Model Date Code Venue
VideoChat: Chat-Centric Video UnderstandingStar   VideoChat 05/2023 code demo arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStar TimeChat 12/2023 code arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Title Date Code Data Venue

Captioning and Description

Grounding and Retrieval

Question Answering

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
      url={http://arxiv.org/abs/2312.17432}
}

About

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.