Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Extending the functionality of MLLMs by integrating an additional region-level vision encoder.

Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Install
PVIT Weights
Data Generation
Demo
Data
Train
Evaluation

Install

Clone this repository and navigate to PVIT folder

git clone https://github.com/THUNLP-MT/PVIT.git 
cd PVIT

Install Package

conda create -n pvit python=3.9.6
conda activate pvit
pip install -r requirements.txt

Install RegionCLIP

git clone https://github.com/microsoft/RegionCLIP.git
pip install -e RegionCLIP

Click here for more details.

PVIT Weights

To get PVIT weights, please first download weights of LLaMA and RegionCLIP. For RegionCLIP, please download regionclip_pretrained-cc_rn50x4.pth.

Click here for PVIT checkpoints. Please put all the weights in folder model_weights and merge PVIT weights with LLaMA weights through the following command.

BASE_MODEL=model_weights/llama-7b TARGET_MODEL=model_weights/pvit DELTA=model_weights/pvit-delta ./scripts/delta_apply.sh

Data Generation

We provide prompts and few-shot examples used when querying ChatGPT in both task-specific instruction data generation and general instruction data generation (Figure 3 (b) and Figure 3 (c) in our paper).

The data_generation/task-specific folder includes seeds, prompts and examples in single-turn conversation generation and multi-turn conversation generation. Single-turn conversation generation includes five types of tasks: small object recognition, object relationship-based reasoning, optical character recognition (OCR), object attribute-based reasoning, and same-category object discrimination,

The data_generation/general folder includes seeds, prompts and examples used in general instruction data generation.

Demo

To run our demo, you need to prepare PVIT checkpoints locally. Please follow the instructions here to download and merge the checkpoints.

Web Server

To run the demo, please first launch a web server with the following command.

MODEL_PATH=model_weights/pvit CONTROLLER_PORT=39996 WORKER_PORT=40004 ./scripts/model_up.sh

Streamlit Web UI

Run the following command to run a Streamlit demo locally. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_demo.sh

CLI Inference

Run the following command to do cli inference locally. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_cli.sh

Data

You can download stage1 and stage2 training data on huggingface. You are required to download pictures of COCO2017 Train, SBU Captioned Photo, Visual Genome, GQA and Visual Commonsense Reasoning datasets as well. Please put stage1 and stage2 data, and the downloaded pictures in folder data as follows. You can modify image_paths in data/stage1/mapping.yaml and data/stage2/mapping.yaml to change the path of downloaded pictures.

Train

Our model is trained in two stages. In stage 1, we initialize the model with the pre-trained LLaVA, and only train the linear projection layer that is responsible for transforming the region features. In stage 2, we only keep the parameters of the image encoder and the region encoder frozen, and fine-tune the rest of the model.

To train PVIT, please download the pretrained LLaVA checkpoints, and put it in folder model_weights.

The following commands are for stage 1 training.

export MODEL_PATH="model_weights/llava-lightning-7b-v1"
export REGION_CLIP_PATH="model_weights/regionclip_pretrained-cc_rn50x4.pth"
export DATA_PATH="data/stage1"
export OUTPUT_DIR="checkpoints/stage1_ckpt"
export PORT=25001
./scripts/train_stage1.sh

The following commands are for stage 2 training.

export MODEL_PATH="checkpoints/stage1_ckpt"
export REGION_CLIP_PATH="model_weights/regionclip_pretrained-cc_rn50x4.pth"
export DATA_PATH="data/stage2"
export OUTPUT_DIR="checkpoints/stage2_ckpt"
export PORT=25001
./scripts/train_stage2.sh

Evaluation

We propose FineEval dataset for human evaluation. See folder fine_eval for the dataset and model outputs. The files in the folder are as follows.

images: Image files of FineEval dataset.
instructions.jsonl: Questions of FineEval dataset.
pvit.jsonl: The results of PVIT (ours) model.
llava.jsonl: The results of LLaVA model.
shikra.jsonl: The results of Shikra model.
gpt4roi.jsonl: The results of GPT4RoI model.

To run PVIT on FineEval dataset, you can launch a web server and run the following command. The port of MODEL_ADDR should be consistant with WORKER_PORT.

MODEL_ADDR=http://0.0.0.0:40004 ./scripts/run_fine_eval.sh

Citation

If you find PVIT useful for your research and applications, please cite using this BibTeX:

@misc{chen2023positionenhanced,
      title={Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models}, 
      author={Chi Chen and Ruoyu Qin and Fuwen Luo and Xiaoyue Mi and Peng Li and Maosong Sun and Yang Liu},
      year={2023},
      eprint={2308.13437},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon, which has the amazing multi-modal capabilities!
Vicuna: the codebase LLaVA built upon, and the base model Vicuna-13B that has the amazing language capabilities!
RegionCLIP: Our prompt encoder.

PVIT-official / PVIT