- Awesome-LLMs-for-Video-Understanding
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | Socratic Models | 04/2022 | project page | arXiv |
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions |
Video ChatCaptioner | 04/2023 | code | arXiv |
VLog: Video as a Long Document |
VLog | 04/2023 | code | - |
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System | ChatVideo | 04/2023 | project page | arXiv |
MM-VID: Advancing Video Understanding with GPT-4V(ision) | MM-VID | 10/2023 | - | arXiv |
MISAR: A Multimodal Instructional System with Augmented Reality |
MISAR | 10/2023 | project page | ICCV |
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Learning Video Representations from Large Language Models |
LaViLa | 12/2022 | code | CVPR |
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | Vid2Seq | 02/2023 | code | CVPR |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset |
VAST | 05/2023 | code | NeurIPS |
Merlin:Empowering Multimodal LLMs with Foresight Minds | Merlin | 12/2023 | - | arXiv |
Title | Model | Date | Code | Venue |
---|---|---|---|---|
MIMIC-IT: Multi-Modal In-Context Instruction Tuning |
Otter | 06/2023 | code | arXiv |
VideoLLM: Modeling Video Sequence with Large Language Models |
VideoLLM | 05/2023 | code | arXiv |
Title | Model | Date | Code | Venue |
---|---|---|---|---|
VTimeLLM: Empower LLM to Grasp Video Moments |
VTimeLLM | 11/2023 | code | arXiv |
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | GPT4Video | 11/2023 | - | arXiv |
Title | Model | Date | Code | Venue |
---|---|---|---|---|
VideoChat: Chat-Centric Video Understanding |
VideoChat | 05/2023 | code demo | arXiv |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
PG-Video-LLaVA | 11/2023 | code | arXiv |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding |
TimeChat | 12/2023 | code | arXiv |
Title | Date | Code | Data | Venue |
---|
We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!
If you find our survey useful for your research, please cite the following paper:
@article{vidllmsurvey,
title={Video Understanding with Large Language Models: A Survey},
author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
journal={arXiv preprint arXiv:2312.17432},
year={2023},
url={http://arxiv.org/abs/2312.17432}
}