vhzy / VTimeLLM

Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VTimeLLM [Paper]

Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

PWC

PWC

PWC

PWC

PWC

PWC

PWC


πŸ“’ Latest Updates

  • Dec-14: Released the training code and data. All the resources including models, datasets and extracted features are available here. πŸ”₯πŸ”₯
  • Dec-4: VTimeLLM: demo released.

VTimeLLM Overview πŸ’‘

VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary.

VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents.

framework


Contributions πŸ†

  • We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.
  • We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.
  • We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.

Installation πŸ”§

We recommend setting up a conda environment for the project:

conda create --name=vtimellm python=3.10
conda activate vtimellm

git clone https://github.com/huangb23/VTimeLLM.git
cd VTimeLLM
pip install -r requirements.txt

Additionally, install additional packages for training cases.

pip install ninja
pip install flash-attn --no-build-isolation

Running Demo Offline πŸ’Ώ

To run the demo offline, please refer to the instructions in offline_demo.md.

Training πŸš‹

For training instructions, check out train.md.

Qualitative Analysis πŸ”

A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks.

Video Understanding and Conversational Tasks πŸ’¬

0


Creative Tasks πŸ–ŒοΈ

1


Fine-grained Understanding Tasks 🌐

2


Video Reasoning Tasks ❓

3


Acknowledgements πŸ™

We are grateful for the following awesome projects our VTimeLLM arising from:

  • LLaVA: Large Language and Vision Assistant
  • FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  • LLaMA: Open and Efficient Foundation Language Models
  • Vid2seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
  • InternVid: A Large-scale Video-Text dataset

If you're using VTimeLLM in your research or applications, please cite using this BibTeX:

@article{huang2023vtimellm,
  title={VTimeLLM: Empower LLM to Grasp Video Moments},
  author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu},
  journal={arXiv preprint arXiv:2311.18445},
  year={2023}
}

License πŸ“œ

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟

About

Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

License:Other


Languages

Language:Python 95.6%Language:Shell 4.4%