Awesome-LLMs-for-Video-Understanding

Table of Contents

Video Understanding
Video Generation
Dataset
Evaluation

Video Understanding

Title	Date	Code	Data	Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	06/2023	code	-
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	06/2023	code	-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	04/2023	code	-
Garbage in, garbage out: Zero-shot detection of crime using Large Language Models	07/2023	code	-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	07/2023	code	-	-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	07/2023	code	-	-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	06/2023	code	-
VALLEY: Video Assistant with Large Language model Enhanced abilitY	06/2023	code	-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	06/2023	code	-
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	06/2023	-	-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	06/2023	code	-
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	06/2023	code	-	-
FunQA: Towards Surprising Video Comprehension	06/2023	code	-	-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	05/2023	code	-	-
VideoChat: Chat-Centric Video Understanding	05/2023	code	demo
VideoLLM: Modeling Video Sequence with Large Language Models	05/2023	code	-
Self-Chained Image-Language Model for Video Localization and Question Answering	05/2023	code	-
A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot	05/2023	-	-
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	05/2023	-	-
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering	04/2023	-	-
VLog: Video as a Long Document	04/2023	demo	-
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	04/2023	code	-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	04/2023	project page	-
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos	03/2023	code	-
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	03/2023	code	-
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	02/2023	code	-
Learning Video Representations from Large Language Models	12/2022	code	-
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners	05/2022	code	-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	04/2022	project page	-

Video Generation

Title	Date	Code	Data	Venue
NExT-GPT: Any-to-Any Multimodal LLM	09/2023	code	-
Generative Pretraining in Multimodality	07/2023	code	-

Dataset

Title	Date	Code	Data
VidChapters-7M: Video Chapters at Scale	09/2023	code	-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	07/2023	code	-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	04/2023	code	-
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	06/2023	code	-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	05/2023	code	-

Evaluation

Title	Date	Code	Data
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	05/2023	code	-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	07/2023	code	-
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	07/2023	code	-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	11/2023	code	-
VLM-Eval: A General Evaluation on Video Large Language Models	11/2023	-	-

Gary-code / Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding

Video Understanding

Video Generation

Dataset

Evaluation

About