chakrabortyrajatsubhra / Video-ChatGPT

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Video-ChatGPT πŸŽ₯ πŸ’¬

Oryx Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Installation πŸ”§

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline πŸ’Ώ

To run the demo offline, please refer to the instructions in offline_demo.md.


Training πŸš‹

For training instructions, check out train_video_chatgpt.md.


Video Instruction Dataset for ADL:

If you want the dataset and features let me know.

Qualitative Analysis πŸ”

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks πŸŽ₯

sample1


Creative and Generative Tasks πŸ–ŒοΈ

sample5


Spatial Understanding 🌐

sample8


Video Understanding and Conversational Tasks πŸ’¬

sample10


Action Recognition πŸƒ

sample22


Question Answering Tasks ❓

sample14


Temporal Understanding ⏳

sample18


License πŸ“œ

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

License:Creative Commons Attribution 4.0 International


Languages

Language:Python 99.4%Language:Shell 0.6%