dongyh20 / Chain-of-Spot

Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models

Home Page:https://sites.google.com/view/chain-of-spot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Zuyan Liu*,1Yuhao Dong*,1Yongming Rao2,✉Jie Zhou1Jiwen Lu1,✉

1Tsinghua University   2Tencent  * Equal Contribution  ✉ Corresponding Author

Project Page | Arxiv Paper | Huggingface Model

Chain-of-Spot

Chain-of-Spot (CoS) encourages Large Vision-Language Models to identify the key region of interest (ROI) in the image condition on the posed questions or instructions, and reasoning through an interactive manner.

This technique allows VLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features and improving the ability of visual understanding.

Updates

[2024-03]

  1. 🤗 Introducing our project homepage: https://sites.google.com/view/chain-of-spot
  2. 🤗 Check our paper introducing Chain-of-Spot in details.
  3. 🤗 Check our model on huggingface.

Get Started

  1. Environmental Setup: We choose LLaVA-1.5 as our base model. You can run the following scripts to set-up your environment for Chain-of-Spot evaluation:

    git clone https://github.com/dongyh20/Chain-of-Spot.git
    cd Chain-of-Spot
    conda create -n cos python=3.10 -y
    conda activate cos
    pip install -e .
    

    For the Chain-of-Spot fine-tuning from LLaVA-1.5, please follow the following scripts:

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    
  2. Initial Weights: We use LLaVA-1.5-7B and LLaVA-1.5-13B for finetuning, you may download these models and put them in the ./checkpoint folder.

  3. Download Data: The dataset structure is the same as used in LLaVA, and we provide json files to modify original LLaVA training dataset into our dataset in the following part. To correctly download the data, please check the instructions.

    After downloading all of them, organize the data as follows in ./playground/data

    ├── coco
    │   └── train2017
    ├── gqa
    │   └── images
    ├── ocr_vqa
    │   └── images
    ├── textvqa
    │   └── train_images
    └── vg
        ├── VG_100K
        └── VG_100K_2
    
  4. Training Data Preparations: We migrate the brilliant work of LRP++ to detect the correct ROI corresponding to a single question or instruction. You can directly download our generated dataset to reproduce our results from Google Drive. You may also follow the Notebook to prepare your own data.

  5. Evaluations on Various Benchmarks: We follow the Evaluation Docs in LLaVA to conduct our experiments. If you find it laborious and complex, please check LMMs-Eval for faster evaluation.

  6. Start Training! The finetuning process takes around 20 hours on 8*A100 (80G) for LLaVA-1.5-13B. We finetune LLaVA-1.5 using Deepspeed Zero-3, you can directly run the scripts to launch training: bash ./scripts/v1_5/finetune_CoS_13b.sh

Contact: Leave issue or contact liuzuyan19@gmail.com and dongyh20@mails.tsinghua.edu.cn. We are on call to respond.

Quantitative and Qualitative Results

Comparisons with State-of-the-Art Models

Our Chain-of-Spot (CoS) consistently improves the vanilla LLaVA-1.5 in all the benchmarks under different language model sizes. The best results are highlighted bold.

Method Language VQA-v2 GQA VizWiz SQA Text-VQA OKVQA
LLaVA-1.5-7B Vicuna-7B 78.5 62.0 50.0 66.8 58.2 57.9
LLaVA-1.5-7B + CoS Vicuna-7B 80.7 63.7 50.8 68.2 60.9 58.4
LLaVA-1.5-13B Vicuna-13B 80.0 63.3 53.6 71.6 61.3 60.9
LLaVA-1.5-13B + CoS Vicuna-13B 81.8 64.8 58.0 71.9 62.4 62.9

LLaVA-1.5 with Chain-of-Spot (CoS) a achieves state-of-the-art performance on all the multimodal benchmarks, surpassing LVLMs by a large margin. The best results are highlighted bold.

Method Language SEED SEED_Img MME MMB POPE MM-Vet
LLaVA-1.5-7B Vicuna-7B 58.6 66.1 1510.7 64.3 85.9 30.5
LLaVA-1.5-7B + CoS Vicuna-7B 59.7 67.1 1501.1 64.4 86.4 30.8
LLaVA-1.5-13B Vicuna-13B 61.6 68.2 1531.3 67.7 85.9 35.4
LLaVA-1.5-13B + CoS Vicuna-13B 62.3 69.6 1546.1 68.2 86.1 37.6

Visualizations

Visualizations on Chain-of-Spot. Chain-of-Spot shows the reasonable region of interest condition on the given questions.

Generation comparisons after implementing Chain-of-Spot. Chain-of-Spot corrects the focus and the answers of LLaVA model on complex visual question cases.

Citation

If you found this repository useful, please consider citing:

@article{liu2024chain,
title={Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models},
author={Liu, Zuyan and Dong, Yuhao and Rao, Yongming and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2403.12966},
year={2024}
}

Acknowledgements

We thank the LLaVA team for their great contribution to the open-source VLM community.

About

Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models

https://sites.google.com/view/chain-of-spot

License:Apache License 2.0


Languages

Language:Python 89.2%Language:Shell 7.3%Language:JavaScript 1.8%Language:HTML 1.4%Language:CSS 0.3%