yunzhongfei / Valley

The official repository of "Video assistant towards large language model makes everything easy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⛰️Valley: Video Assistant with Large Language model Enhanced abilitY

Understanding Complex Videos Relying on Large Language and Vision Models

[Project Page] [Paper][demo]

The online demo is no longer available, because we released the code for offline demo deployment

Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo*, Ziwang Zhao*, Min Yang* (*Equal Contribution)


Generated by stablecog via "A cute llama with valley"

Code License Data License Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Release

  • [7/5] 🫧 Release training code for valley, and upload our pretraining data
  • [6/21] 🫧 upload offline demo code.
  • [6/14] 🫧 build a share link [demo].
  • [6/13] 🫧 We uploaded model weight of Valley-13b-v1-delta.
  • [6/12] 🫧 We released Valley: Video Assistant with Large Language model Enhanced abilitY. Checkout the paper.

Todo

  • Release inference code
  • Upload weight of Valley-v1 and build a share link demo
  • Upload offline demo code
  • Release 703k pretraining data and 40k instruction tuning data
  • Upload pretrain and tuning code
  • Upload weight of Valley-GLM-6B and Valley-v3

Install

  1. Clone this repository and navigate to Valley folder
git clone https://github.com/RupertLuo/Valley.git
cd Valley
  1. Install Package
conda create -n valley python=3.10 -y
conda activate valley
pip install --upgrade pip 
pip install -e .

ValleyWeight

We release Valley delta weights weights to comply with the LLaMA model license. You can apply this delta weights to original LLaMA model weight through the instructions blew:

  1. Get the original LLaMA weights in the huggingface format by following the instructions structions here.
  2. Use the following scripts to get Valley weights by applying our delta (13b-v1).

Valley 13b v1

python3 valley/model/apply_delta.py \
    --base /path/to/llama-13b \
    --target /output/path/to/Valley-13B-v1 \
    --delta /path/to/valley-13b-v1-delta

Web UI


The framework of this webUI comes from LLaVA and FastChat, we modified a part of the code to make this demo support the input of video and images.

launch a controller

python valley/serve/controller.py

launch a model worker

python valley/serve/model_worker.py --model-path /path/to/valley-13b-v1

Ps: At present, only single card mode is supported to load the model, and at least 30G of video memory is required, so the graphics card needs at least one Tesla V100.

launch a gradio demo

python valley/serve/gradio_web_server_video.py --share

Inference Valley in Command Line

inference CLI

python3 inference/run_valley.py --model-name [PATH TO VALLEY WEIGHT] --video_file [PATH TO VIDEO] --quary [YOUR QUERY ON THE VIDEO]

Train Valley Step By Step

Inspired by LLAVA, we adopt a two-stage training method. The pre-training stage uses the Valley-webvid2M-Pretrain-703K and LLaVA-CC3M-Pretrain-595K. And fine-tune stage uses LLaVA-instruct-150K , VideoChat-instruct-11K and Valley-instruct-40K (Still generating and cleaning, Valley-13b-v1 trained from previous 2 dataset)

Pretrain

accelerate launch --main_process_port 6777 \
    --config_file ./configs/config.yaml 
    valley/train/train_accelerate.py \
    --model_name_or_path path/to/vicuna-13b \
    --data_path path/to/LLaVA-CC3M-Pretrain-595K/chat.json \
    --video_data_path path/to/webvid_703K/chat.json \
    --image_folder path/to/LLaVA-CC3M-Pretrain-595K/image/folder \
    --video_folder path/to/webvid/video/folder \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --tune_llm_layer none \
    --mm_use_im_start_end True \
    --bf16 False \
    --fp16 True \
    --output_dir path/to/save/zero/model \
    --num_train_epochs 6 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 2400 \
    --save_total_limit 3 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --fast_epoch False

Finetune

accelerate launch --main_process_port 6777 \
    --config_file ./configs/config.yaml \
    valley/train/train_accelerate.py \
    --model_name_or_path path/to/pretrain/valley/model \
    --data_path path/to/llava_instruct_150k.json \
    --video_data_path path/to/videochat-11k/chat.json \
    --image_folder path/to/llava_instruct_150k/image/folder \
    --video_folder path/to/webvid/video/folder \
    --tune_mm_mlp_adapter True \
    --vision_tower openai/clip-vit-large-patch14 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 False \
    --fp16 True \
    --output_dir path/to/save/model/folder \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 3000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --fast_epoch False 

Acknowledgement

  • LLaVA & MOSS: Thanks to these two repositories for providing high-quality code, our code is based on them.

Citation

If the project is helpful to your research, please consider citing our paper as follows

@misc{luo2023valley,
      title={Valley: Video Assistant with Large Language model Enhanced abilitY}, 
      author={Ruipu Luo and Ziwang Zhao and Min Yang and Junwei Dong and Minghui Qiu and Pengcheng Lu and Tao Wang and Zhongyu Wei},
      year={2023},
      eprint={2306.07207},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

The official repository of "Video assistant towards large language model makes everything easy"


Languages

Language:Python 98.6%Language:Shell 1.4%