SihengLi99 / TextBind

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TextBind: Multi-turn Interleaved Multimodal Instruction-following

Data License Code License Model Weight License

🌐 Project Page β€’ πŸ€— Online Demo β€’ πŸ“ƒ Paper β€’ ⏬ Data β€’ πŸ€– Model


Content:


1. Introduction: [Back to Top]

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.


2. Build Our Demo Locally: [Back to Top]

2.1. Install Environment:

Install the Pytorch package with the correct cuda version, for example

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Then install the required environment, please run

pip install -r requirements.txt

2.2. Prepare Vsion Model:

Follow BLIP-2, we use EVA-CLIP as the vision model, you can run the following commands to prepare:

import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")
vision_model = model.vision_model
vision_model.save_pretrained("checkpoint/blip2_vision_model")

2.3. Prepare TextBind Weights:

Base Language Model Huggingface Weights Address Maximum Sequence Length
Llama-2-7b-chat-hf SihengLi/TextBind 768

Then please put the downloaded checkpoints under the ./checkpoint/ directory

2.4. Running Demo:

Please set the checkpoints in scripts/run_demo.sh as

CHECKPOINT=./checkpoint/second_stage_model.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl
SD_BASE=stabilityai/stable-diffusion-xl-base-1.0
SD_REFINER=stabilityai/stable-diffusion-xl-refiner-1.0

Then you can run the demo locally as

bash scripts/run_demo.sh

3. Train Your Own Models Using Our TextBind Recipe: [Back to Top]

Prerequisites: Before training the model, making sure the environment is properly installed and the vision model has been prepared. You can refer to [Here] for more information.

3.1. Data Preparation:

Declaimer: To ensure the reproducibility of our results, we have released our training dataset. The dataset must be used for research purpose only.

Training Stage Dataset Address
Multimodal Alignment CC3M+CC12M+SBU
Multimodal Instruction Following TextBind

After downloading, put the downloaded file under the ./data/ directory.

For our textbind, you need to download the images manually using the url_list provided in the downloaded file and rename the download image according to the image_list.

**** The data directory should look like:

.
└── ./data/ 
     └── /cc_sbu/
         └── /cc_sbu_dataset/
             └── {00000..01254}.tar
     └── /textbind/
         β”œβ”€β”€ train.json
         └── /images/
             β”œβ”€β”€ 490272.png
             β”œβ”€β”€ 862235.png
             └── ...

3.2. Prepare BLIP-2 Q-Former:

BLIP-2 Q-Former is utilized for the initialization of our Q-Former, run:

import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

state_dict = model.state_dict()
state_dict = {key: value for key, value in state_dict.items() if key.split(".")[0] in ["query_tokens", "qformer"]}
torch.save(state_dict, "checkpoint/blip2_qformer.pt")

3.3 Training Configurations:

The table below show the training hyperparameters used in our experiments. The hyperparameters are selected based on the constrain of our computational resources, i.e. 8 x A100 (40G) GPUs.

Training Stage Language Model Epoch Batch Size Learning Rate Training Modules
Multimodal Alignment Llama-2-7b-chat-hf 2 256 1e-4 Q-Former, Linear
Multimodal Instruction Following Llama-2-7b-chat-hf 3 64 1e-5 QFormer, Linear, LLM

3.4. Training TextBind:

For the multimodal alignment stage, please set the paths in scripts/run_first_stage.sh as

TRAIN_DATA_PATH=${your_first_stage_data_path}
CHECKPOINT=./checkpoint/blip2_qformer.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl

then run the following commands:

bash scripts/run_first_stage.sh

For the multimodel instruction tuning stage, please set the paths in scripts/run_second_stage.sh as

TRAIN_DATA_PATH=${your_second_stage_data_path}
CHECKPOINT=${your_first_stage_model_path}
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl

then run the following commands:

bash scripts/run_second_stage.sh

Usage and License Notices:

TextBind is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The delta weights are also CC BY NC 4.0 (allowing only non-commercial use).


Citation:

If you found TextBind useful in your research or applications, please kindly cite using the following BibTeX:

@article{li2023textbind,
  title={TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild},
  author={Li, Huayang and Li, Siheng and Cai, Deng and Wang, Longyue and Liu, Lemao and Watanabe, Taro and Yang, Yujiu and Shi, Shuming},
  year={2023}
}

About


Languages

Language:Python 84.4%Language:HTML 9.9%Language:Shell 5.7%