Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Zuyan Liu^*,1 Yuhao Dong^*,1 Yongming Rao^2,✉ Jie Zhou¹ Jiwen Lu^1,✉

¹Tsinghua University ²Tencent ^* Equal Contribution^✉ Corresponding Author

Project Page | Arxiv Paper | Huggingface Model

Chain-of-Spot

Chain-of-Spot (CoS) encourages Large Vision-Language Models to identify the key region of interest (ROI) in the image condition on the posed questions or instructions, and reasoning through an interactive manner.

This technique allows VLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features and improving the ability of visual understanding.

Updates

[2024-03]

🤗 Introducing our project homepage: https://sites.google.com/view/chain-of-spot
🤗 Check our paper introducing Chain-of-Spot in details.
🤗 Check our model on huggingface.

Get Started

Environmental Setup: We choose LLaVA-1.5 as our base model. You can run the following scripts to set-up your environment for Chain-of-Spot evaluation:
```
git clone https://github.com/dongyh20/Chain-of-Spot.git
cd Chain-of-Spot
conda create -n cos python=3.10 -y
conda activate cos
pip install -e .
```
For the Chain-of-Spot fine-tuning from LLaVA-1.5, please follow the following scripts:
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
Initial Weights: We use LLaVA-1.5-7B and LLaVA-1.5-13B for finetuning, you may download these models and put them in the ./checkpoint folder.
Download Data: The dataset structure is the same as used in LLaVA, and we provide json files to modify original LLaVA training dataset into our dataset in the following part. To correctly download the data, please check the instructions.

After downloading all of them, organize the data as follows in ./playground/data
```
├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2
```
Training Data Preparations: We migrate the brilliant work of LRP++ to detect the correct ROI corresponding to a single question or instruction. You can directly download our generated dataset to reproduce our results from Google Drive. You may also follow the Notebook to prepare your own data.
Evaluations on Various Benchmarks: We follow the Evaluation Docs in LLaVA to conduct our experiments. If you find it laborious and complex, please check LMMs-Eval for faster evaluation.
Start Training! The finetuning process takes around 20 hours on 8*A100 (80G) for LLaVA-1.5-13B. We finetune LLaVA-1.5 using Deepspeed Zero-3, you can directly run the scripts to launch training: bash ./scripts/v1_5/finetune_CoS_13b.sh

Contact: Leave issue or contact liuzuyan19@gmail.com and dongyh20@mails.tsinghua.edu.cn. We are on call to respond.

Quantitative and Qualitative Results

Comparisons with State-of-the-Art Models

Our Chain-of-Spot (CoS) consistently improves the vanilla LLaVA-1.5 in all the benchmarks under different language model sizes. The best results are highlighted bold.

Method	Language	VQA-v2	GQA	VizWiz	SQA	Text-VQA	OKVQA
LLaVA-1.5-7B	Vicuna-7B	78.5	62.0	50.0	66.8	58.2	57.9
LLaVA-1.5-7B + CoS	Vicuna-7B	80.7	63.7	50.8	68.2	60.9	58.4
LLaVA-1.5-13B	Vicuna-13B	80.0	63.3	53.6	71.6	61.3	60.9
LLaVA-1.5-13B + CoS	Vicuna-13B	81.8	64.8	58.0	71.9	62.4	62.9

LLaVA-1.5 with Chain-of-Spot (CoS) a achieves state-of-the-art performance on all the multimodal benchmarks, surpassing LVLMs by a large margin. The best results are highlighted bold.

Method	Language	SEED	SEED_Img	MME	MMB	POPE	MM-Vet
LLaVA-1.5-7B	Vicuna-7B	58.6	66.1	1510.7	64.3	85.9	30.5
LLaVA-1.5-7B + CoS	Vicuna-7B	59.7	67.1	1501.1	64.4	86.4	30.8
LLaVA-1.5-13B	Vicuna-13B	61.6	68.2	1531.3	67.7	85.9	35.4
LLaVA-1.5-13B + CoS	Vicuna-13B	62.3	69.6	1546.1	68.2	86.1	37.6

Visualizations

Visualizations on Chain-of-Spot. Chain-of-Spot shows the reasonable region of interest condition on the given questions.

Generation comparisons after implementing Chain-of-Spot. Chain-of-Spot corrects the focus and the answers of LLaVA model on complex visual question cases.

Citation

If you found this repository useful, please consider citing:

@article{liu2024chain,
title={Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models},
author={Liu, Zuyan and Dong, Yuhao and Rao, Yongming and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2403.12966},
year={2024}
}

Acknowledgements

We thank the LLaVA team for their great contribution to the open-source VLM community.

dongyh20 / Chain-of-Spot