Oryx Video-ChatGPT 🎥 💬

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence

Demo	Paper	Demo Clips	Offline Demo	Training	Video Instruction Data	Quantitative Evaluation	Qualitative Analysis
			Offline Demo	Training	Video Instruction Dataset	Quantitative Evaluation	Qualitative Analysis

📢 Latest Updates

Jun-08 : Released the training code, offline demo, instructional data and technical report. All the resources including models, datasets and extracted features are available here. 🔥🔥
May-21 : Video-ChatGPT: demo released.

Online Demo 💻

🔥🔥 You can try our demo using the provided examples or by uploading your own videos HERE. 🔥🔥

🔥🔥 Or click the image to try the demo! 🔥🔥 You can access all the videos we demonstrate on here.

Video-ChatGPT Overview 💡

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

Contributions 🏆

We introduce 100K high-quality video-instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets of high-quality.
We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models.
Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline 💿

To run the demo offline, please refer to the instructions in offline_demo.md.

Training 🚋

For training instructions, check out train_video_chatgpt.md.

Video Instruction Dataset 📂

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.

Quantitative Evaluation 📊

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

Model	MSVD		MSRVTT		TGIF		Activity Net
	Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
FrozenBiLM	32.2	--	16.8	--	41.0	--	24.7	--
Video Chat	56.3	2.8	45.0	2.5	34.4	2.3	26.5	2.2
Video-ChatGPT	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.7

Video-based Generative Performance Benchmarking

Evaluation Aspect	Video Chat	Video-ChatGPT
Correctness of Information	2.50	2.25
Detail Orientation	2.50	2.57
Contextual Understanding	2.54	2.69
Temporal Understanding	1.98	2.16
Consistency	1.84	2.20

Qualitative Analysis 🔍

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks 🎥

Creative and Generative Tasks 🖌️

Spatial Understanding 🌐

Video Understanding and Conversational Tasks 💬

Action Recognition 🏃

Question Answering Tasks ❓

Temporal Understanding ⏳

Acknowledgements 🙏

LLaMA: A great attempt towards open and efficient LLMs!
Vicuna: Has the amazing language capabilities!
LLaVA: our architecture is inspired from LLaVA.
Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Salman Khan, Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia, without which this project would not be possible.

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

    @article{Maaz2023VideoChatGPT,
        title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
        author={Muhammad Maaz, Hanoona Rasheed, Salman Khan and Fahad Khan},
        journal={ArXiv 2306.05424},
        year={2023}
    }

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.

hailin-shi / Video-ChatGPT