⛰️Valley: Video Assistant with Large Language model Enhanced abilitY

Understanding Complex Videos Relying on Large Language and Vision Models

The online demo is no longer available, because we released the code for offline demo deployment

Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo*, Ziwang Zhao*, Min Yang* (*Equal Contribution)

Generated by stablecog via "A cute llama with valley"

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Release

[7/5] 🫧 Release training code for valley, and upload our pretraining data
[6/21] 🫧 upload offline demo code.
[6/14] 🫧 build a share link ~~[demo]~~.
[6/13] 🫧 We uploaded model weight of Valley-13b-v1-delta.
[6/12] 🫧 We released Valley: Video Assistant with Large Language model Enhanced abilitY. Checkout the paper.

Todo

~~Release inference code~~
~~Upload weight of Valley-v1 and build a share link demo~~
~~Upload offline demo code~~
~~Release 703k pretraining data~~ and 40k instruction tuning data
~~Upload pretrain and tuning code~~
Upload weight of Valley-GLM-6B and Valley-v3

Install

Clone this repository and navigate to Valley folder

git clone https://github.com/RupertLuo/Valley.git
cd Valley

Install Package

conda create -n valley python=3.10 -y
conda activate valley
pip install --upgrade pip 
pip install -e .

ValleyWeight

We release Valley delta weights weights to comply with the LLaMA model license. You can apply this delta weights to original LLaMA model weight through the instructions blew:

Get the original LLaMA weights in the huggingface format by following the instructions structions here.
Use the following scripts to get Valley weights by applying our delta (13b-v1).

Valley 13b v1

python3 valley/model/apply_delta.py \
    --base /path/to/llama-13b \
    --target /output/path/to/Valley-13B-v1 \
    --delta /path/to/valley-13b-v1-delta

Web UI

The framework of this webUI comes from LLaVA and FastChat, we modified a part of the code to make this demo support the input of video and images.

launch a controller

python valley/serve/controller.py

launch a model worker

python valley/serve/model_worker.py --model-path /path/to/valley-13b-v1

Ps: At present, only single card mode is supported to load the model, and at least 30G of video memory is required, so the graphics card needs at least one Tesla V100.

launch a gradio demo

python valley/serve/gradio_web_server_video.py --share

Inference Valley in Command Line

inference CLI

python3 inference/run_valley.py --model-name [PATH TO VALLEY WEIGHT] --video_file [PATH TO VIDEO] --quary [YOUR QUERY ON THE VIDEO]

Train Valley Step By Step

Inspired by LLAVA, we adopt a two-stage training method. The pre-training stage uses the Valley-webvid2M-Pretrain-703K and LLaVA-CC3M-Pretrain-595K. And fine-tune stage uses LLaVA-instruct-150K , VideoChat-instruct-11K and Valley-instruct-40K (Still generating and cleaning, Valley-13b-v1 trained from previous 2 dataset)

Pretrain

accelerate launch --main_process_port 6777 \
    --config_file ./configs/config.yaml 
    valley/train/train_accelerate.py \
    --model_name_or_path path/to/vicuna-13b \
    --data_path path/to/LLaVA-CC3M-Pretrain-595K/chat.json \
    --video_data_path path/to/webvid_703K/chat.json \
    --image_folder path/to/LLaVA-CC3M-Pretrain-595K/image/folder \
    --video_folder path/to/webvid/video/folder \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --tune_llm_layer none \
    --mm_use_im_start_end True \
    --bf16 False \
    --fp16 True \
    --output_dir path/to/save/zero/model \
    --num_train_epochs 6 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 2400 \
    --save_total_limit 3 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --fast_epoch False

Finetune

accelerate launch --main_process_port 6777 \
    --config_file ./configs/config.yaml \
    valley/train/train_accelerate.py \
    --model_name_or_path path/to/pretrain/valley/model \
    --data_path path/to/llava_instruct_150k.json \
    --video_data_path path/to/videochat-11k/chat.json \
    --image_folder path/to/llava_instruct_150k/image/folder \
    --video_folder path/to/webvid/video/folder \
    --tune_mm_mlp_adapter True \
    --vision_tower openai/clip-vit-large-patch14 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 False \
    --fp16 True \
    --output_dir path/to/save/model/folder \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 3000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --fast_epoch False

Acknowledgement

LLaVA & MOSS: Thanks to these two repositories for providing high-quality code, our code is based on them.

Citation

If the project is helpful to your research, please consider citing our paper as follows

@misc{luo2023valley,
      title={Valley: Video Assistant with Large Language model Enhanced abilitY}, 
      author={Ruipu Luo and Ziwang Zhao and Min Yang and Junwei Dong and Minghui Qiu and Pengcheng Lu and Tao Wang and Zhongyu Wei},
      year={2023},
      eprint={2306.07207},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

yunzhongfei / Valley