cilinyan / VISA

[ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model

Home Page:http://arxiv.org/abs/2407.11325

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VISA: Reasoning Video Object Segmentation via Large Language Model

 GitHub stars arXiv Static Badge

πŸš€ Performance

VISA demonstrates remarkable proficiency in handling complex segmentation tasks that require: (a) reasoning based on world knowledge; (b) inference of future events; and (c) a comprehensive understanding of video content.

πŸ› οΈ Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

πŸ¦„ Training and Validation

1. Training Data Preparation

Before training, please download the datasets, and then configure the path in dataset_config.py.

LISA's Dataset

Follow LISA to prepare LISA's datasets. The dataset folder should be stored in the $LISA_ROOT folder.

LISA_ROOT
β”œβ”€β”€ ade20k
β”œβ”€β”€ coco
β”œβ”€β”€ cocostuff
β”œβ”€β”€ llava_dataset
β”œβ”€β”€ mapillary
β”œβ”€β”€ reason_seg
β”œβ”€β”€ refer_seg
└── vlpart
Chat-UniVi's Dataset

Follow Chat-UniVi/Chat-UniVi-Instruct to prepare Chat-UniVi-Instruct datasets. The dataset folder should be stored in the $ChatUniVi_ROOT folder.

ChatUniVi_ROOT
β”œβ”€β”€ Fine-tuning
β”‚   β”œβ”€β”€ MIMIC_imageonly
β”‚   └── VIDEO
└── ScienceQA_tuning
RVOS's Dataset
  1. Reasoning Video Segmentation Datasets: ReVOS.
  2. Referring Video Segmentation Datasets: Ref-Youtube-VOS, Ref-DAVIS17, MeViS.
  3. Open-Vocabulary Video Instance Segmentation Dataset: LV-VIS. Download mask_dict.json and meta_expressions.json from OneDrive or BaiduPan. Then, put the annotations files in the $RVOS_ROOT/lvvis/train directory as follows.
RVOS_ROOT
β”œβ”€β”€ ReVOS
β”‚   β”œβ”€β”€ JPEGImages 
β”‚   β”œβ”€β”€ mask_dict.json             
β”‚   β”œβ”€β”€ mask_dict_foreground.json   
β”‚   β”œβ”€β”€ meta_expressions_train_.json 
β”‚   └── meta_expressions_valid_.json 
β”œβ”€β”€ lvvis
β”‚   └── train
|       β”œβ”€β”€ JPEGImages
|       β”œβ”€β”€ mask_dict.json
|       └── meta_expressions.json
β”œβ”€β”€ Ref-Youtube-VOS
β”‚   β”œβ”€β”€ meta_expressions
|   |   β”œβ”€β”€ train/meta_expressions.json
|   |   └── valid/meta_expressions.json
β”‚   β”œβ”€β”€ train
|   |   β”œβ”€β”€ JPEGImages
|   |   └── mask_dict.pkl
β”‚   └── valid
|       └── JPEGImages
β”œβ”€β”€ davis17
β”‚   β”œβ”€β”€ meta_expressions
|   |   β”œβ”€β”€ train/meta_expressions.json
|   |   └── valid/meta_expressions.json
β”‚   β”œβ”€β”€ train
|   |   β”œβ”€β”€ JPEGImages
|   |   └── mask_dict.pkl
β”‚   └── valid
|       β”œβ”€β”€ JPEGImages
|       └── mask_dict.pkl
└── mevis

2. Pre-trained weights

Chat-UniVi

To train VISA-7B or 13B, you need to download Chat-UniVi weights from Chat-UniVi-7B and Chat-UniVi-13B.

SAM

Download SAM ViT-H pre-trained weights from the link.

3. Training VISA

# Training VISA-7B
bash scripts/train_7b.sh 

# Extracting fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints.
cd /PATH/TO/VISA-7B/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

# Merge LoRA Weight
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version Chat-UniVi/Chat-UniVi \
  --weight /PATH/TO/VISA-7B/pytorch_model.bin \
  --save_path /PATH/TO/VISA-7B/hf_model

4. Validation

1. Using `VISA` to generate predicted mask of each video [demo]
deepspeed --master_port=24999 train_ds.py \
  --version="/PATH/TO/VISA-7B/hf_model" \
  --vision_pretrained="/PATH/TO/sam_vit_h_4b8939.pth" \
  --log_base_dir="/PATH/TO/LOG_BASE_DIR" \
  --exp_name="val_7b" \
  --balance_sample \
  --dataset="reason_seg" \
  --sample_rates="13" \
  --val_dataset "revos_valid" \
  --eval_only 
2. Using LLaMA-VID to generate target frame for each video

You can directly download the results of our run from OneDrive or BaiduPan

  • Run http_server_mp.py to build the API server for LLaMA-VID [demo]

    python utils_llamavid/llamavid_server.py \
        --vision_tower /PATH/TO/eva_vit_g.pth \
        --image_processor /PATH/TO/openai/clip-vit-large-patch14 \
        --model-path /PATH/TO/YanweiLi/llama-vid-13b-full-224-video-fps-1
  • Using the API for inference [demo]

    python utils_llamavid/llamavid_client.py \
        --video_root /PATH/TO/ReVOS/JPEGImages \
        --data_json_file /PATH/TO/ReVOS/meta_expressions_valid_.json
3. Using XMem for mask propagation [demo]
4. Evaluate ReVOS's performance [demo]
cd tools
python eval_revos.py /PATH/TO/FINAL_ANNOTATION [ARGS]

πŸ“‘ Todo list

  • Release code with Text-guided Frame Sampler's Local Sampling

  • Release VISA model weights

  • Release code with Text-guided Frame Sampler's Global-Local Sampling

⭐ Cite

If you find this project useful in your research, please consider citing:

@article{yan2024visa,
  title={VISA: Reasoning Video Object Segmentation via Large Language Models},
  author={Yan, Cilin and Wang, Haochen and Yan, Shilin and Jiang, Xiaolong and Hu, Yao and Kang, Guoliang and Xie, Weidi and Gavves, Efstratios},
  journal={arXiv preprint arXiv:2407.11325},
  year={2024}
}

πŸŽ–οΈ Acknowledgement

This work is built upon the LLaVA, SAM, LISA, Chat-UniVi, MeViS, LLaMA-VID and XMem.

About

[ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model

http://arxiv.org/abs/2407.11325


Languages

Language:Python 98.5%Language:Cuda 0.7%Language:C++ 0.5%Language:Shell 0.2%Language:Cython 0.1%