TextBind: Multi-turn Interleaved Multimodal Instruction-following

🌐 Project Page • 🤗 Online Demo • 📃 Paper • ⏬ Data • 🤖 Model

Content:

1. Introduction
2. Build Our Demo Locally
3. Train Your Own Models Using Our TextBind Recipe
Usage and License Notices
Citation

1. Introduction: [Back to Top]

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.

2. Build Our Demo Locally: [Back to Top]

2.1. Install Environment:

Install the Pytorch package with the correct cuda version, for example

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Then install the required environment, please run

pip install -r requirements.txt

2.2. Prepare Vsion Model:

Follow BLIP-2, we use EVA-CLIP as the vision model, you can run the following commands to prepare:

import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")
vision_model = model.vision_model
vision_model.save_pretrained("checkpoint/blip2_vision_model")

2.3. Prepare TextBind Weights:

Base Language Model	Huggingface Weights Address	Maximum Sequence Length
Llama-2-7b-chat-hf	SihengLi/TextBind	768

Then please put the downloaded checkpoints under the ./checkpoint/ directory

2.4. Running Demo:

Please set the checkpoints in scripts/run_demo.sh as

CHECKPOINT=./checkpoint/second_stage_model.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl
SD_BASE=stabilityai/stable-diffusion-xl-base-1.0
SD_REFINER=stabilityai/stable-diffusion-xl-refiner-1.0

Then you can run the demo locally as

bash scripts/run_demo.sh

3. Train Your Own Models Using Our TextBind Recipe: [Back to Top]

Prerequisites: Before training the model, making sure the environment is properly installed and the vision model has been prepared. You can refer to [Here] for more information.

3.1. Data Preparation:

Declaimer: To ensure the reproducibility of our results, we have released our training dataset. The dataset must be used for research purpose only.

Training Stage	Dataset Address
Multimodal Alignment	CC3M+CC12M+SBU
Multimodal Instruction Following	TextBind

After downloading, put the downloaded file under the ./data/ directory.

For our textbind, you need to download the images manually using the url_list provided in the downloaded file and rename the download image according to the image_list.

**** The data directory should look like:

.
└── ./data/ 
     └── /cc_sbu/
         └── /cc_sbu_dataset/
             └── {00000..01254}.tar
     └── /textbind/
         ├── train.json
         └── /images/
             ├── 490272.png
             ├── 862235.png
             └── ...

3.2. Prepare BLIP-2 Q-Former:

BLIP-2 Q-Former is utilized for the initialization of our Q-Former, run:

import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

state_dict = model.state_dict()
state_dict = {key: value for key, value in state_dict.items() if key.split(".")[0] in ["query_tokens", "qformer"]}
torch.save(state_dict, "checkpoint/blip2_qformer.pt")

3.3 Training Configurations:

The table below show the training hyperparameters used in our experiments. The hyperparameters are selected based on the constrain of our computational resources, i.e. 8 x A100 (40G) GPUs.

Training Stage	Language Model	Epoch	Batch Size	Learning Rate	Training Modules
Multimodal Alignment	Llama-2-7b-chat-hf	2	256	1e-4	Q-Former, Linear
Multimodal Instruction Following	Llama-2-7b-chat-hf	3	64	1e-5	QFormer, Linear, LLM

3.4. Training TextBind:

For the multimodal alignment stage, please set the paths in scripts/run_first_stage.sh as

TRAIN_DATA_PATH=${your_first_stage_data_path}
CHECKPOINT=./checkpoint/blip2_qformer.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl

then run the following commands:

bash scripts/run_first_stage.sh

For the multimodel instruction tuning stage, please set the paths in scripts/run_second_stage.sh as

TRAIN_DATA_PATH=${your_second_stage_data_path}
CHECKPOINT=${your_first_stage_model_path}
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl

then run the following commands:

bash scripts/run_second_stage.sh

Usage and License Notices:

TextBind is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The delta weights are also CC BY NC 4.0 (allowing only non-commercial use).

Citation:

If you found TextBind useful in your research or applications, please kindly cite using the following BibTeX:

@article{li2023textbind,
  title={TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild},
  author={Li, Huayang and Li, Siheng and Cai, Deng and Wang, Longyue and Liu, Lemao and Watanabe, Taro and Yang, Yujiu and Shi, Shuming},
  year={2023}
}

SihengLi99 / TextBind