ChatVID

💬 Chat about anything on any video! 🎥

Authors:
Yibin Yan🤝, BUPT Yiqin Wang🤝, Tsinghua University
Yansong Tang, Tsinghua-Berkeley Shenzhen Institute
(🤝 = equal contribution, names listed alphabetically)
This work is done during Yibin and Yiqin's internship with Prof. Tang.

Try our demo🤗

The demo is now paused. If you do want to try it out, you can reach out Yiqin with wyq1217@outlook.com

Intro to ChatVID

⭐ ChatVID combines the knowledge from Large Language Models and the sensing ablity of Vision Models and Audio Models.

⭐ ChatVID demonstrate a powerful capability to talk about anything in the video.

⭐ Please give us a Star! For any questions or suggestions, feel free to drop Yiqin an email at wyq1217@outlook.com or open an issue.

Highlights 🔥

🔍 Leverage the power of Large Language Models, Vision Models, and Audio Models to enable conversations about videos.
🤖 Utilize Vicuna as the Large Language Model for understanding user queries and responses.
📷 Incorporate state-of-the-art Vision Models like BLIP2, GRiT, and Vid2Seq for visual understanding and analysis.
🎤 Employ Whisper as an Audio Model to process audio content within videos.
💬 Enable users to have conversations and discussions about any aspect of a video.
🚀 Enhance the overall video-watching experience by providing an interactive and engaging platform.
🚗 ChatVID with Vicuna-7B (8bit) is able to run with a Nvidia GPU with 24G RAM, and 8G CPU RAM.
🎥 ChatVID needs an extra 10G CPU RAM when using Vid2Seq.

Gradio Example ✨

Install Instructions 💻

pip install -r pre-requirements.txt
pip install -r requirements.txt
pip install -r extra-requirements.txt # optional, only for vid2seq

You will also need to install ffmpeg for Whisper. Note that if Whisper encounters permission errors, you may need to specify environment variable DATA_GYM_CACHE_DIR='/YourRootDir/ChatVID/.cache', a writable cache directory.

Setting Up Checkpoints 📦💼

Grit Checkpoints 🚀

Put Grit into pretrained_models folder.

Vicuna Weights 🦙

ChatVID uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights. Then modify the vicuna.model_path in the Infer Config to the folder that contains Vicuna weights.

Vid2Seq Checkpoints (Optional) 🎥📊

Prepare CLIP ViT-L/14 Checkpoint for feature extraction in Vid2Seq. Get CLIP ViT-L/14 Checkpoint. Specify the vid2seq.clip_path in the Infer Config to the checkpoint path. vid2seq.output_path is used to store the generated TFRecords and can be specified to any writable directory. vid2seq.work_dir is the Flax's working directory and can be specified to any writable directory.
Prepare Vid2Seq ActivityNet Checkpoint Get the Vid2Seq ActivityNet Checkpoint. And then rename it as checkpoint_200001. After that, change the vid2seq.checkpoint_path in the Infer Config to the folder directory where contains the checkpoint.

File Structure

ChatVID/
|__config/
    |__...
|__model/
    |__...
|__scenic/
    |__...
|__simclr/
    |__...
|__pretrained_models/
    |__grit_b_densecap_objectdet.pth
|__vicuna-7b/
    |__pytorch_model-00001-of-00002.bin
    |__pytorch_model-00002-of-00002.bin
    |__...
|__vid2seq_ckpt/
    |__checkpoint_200001
|__clip_ckpt/
    |__ViT-L-14.pt
|__app.py
|__README.md
|__pre-requirements.txt
|__requirements.txt
|__extra-requirements.txt
|__LICENSE

Gradio WebUI Usage 🌐

# change all the abs path in config/infer.yaml
python app.py

Acknowledgment

This work is based on Vicuna, BLIP-2, GRiT, Vid2Seq, Whisper. Thanks for their great work!

InvincibleWyq / ChatVID