🐝 Honeybee: Locality-enhanced Projector for Multimodal LLM

This is an official PyTorch Implementation of Honeybee: Locality-enhanced Projector for Multimodal LLM, Junbum Cha^*, Wooyoung Kang^*, Jonghwan Mun^*, Byungseok Roh. [paper]

News and Updates

2024.04 🔥🔥🔥 Honeybee is accepted by CVPR 2024 as a Highlight.
- Of the 2719 accepted papers, 324 (11.9%) were selected as highlights.

Selected Examples

Environment

PyTorch 2.0.1

pip install -r requirements.txt

# additional requirements for demo
pip install -r requirements_demo.txt

Model Zoo

We provide checkpoints from both the pre-training (PT) and finetuning (FT) stages.

Comparison with other SoTA methods (Table 6)

Model	Checkpoints	MMB	MME	SEED-I	LLaVA-w	MM-Vet	MMMU	POPE
Honeybee-C-7B-M144	PT / FT	70.1	1891.3	64.5	67.1	34.9	35.3	83.2
Honeybee-D-7B-M144	PT / FT	70.8	1835.5	63.8	66.3	-	-	-
Honeybee-C-13B-M256	PT / FT	73.2	1944.0	68.2	75.7	35.6	36.4	84.3
Honeybee-D-13B-M256	PT / FT	73.5	1950.0	66.6	72.9	-	-	-

Pushing the limits of Honeybee (Table 7)

Model	Checkpoints	MMB	MME	SEED-I	LLaVA-w	ScienceQA	MM-Vet	MMMU	POPE
Honeybee-C-7B-M256	PT / FT	71.0	1951.3	65.5	70.6	93.2	38.1	37.3	85.5
Honeybee-C-13B-M576	PT / FT	73.6	1976.5	68.6	77.5	94.4	42.2	36.2	85.6

Data Preparation

After Downloading all of data below, organize the data in ./data.
Then, modify the data-specific argument files, such as annotation and image root paths, in configs/data_configs/train_dataset and configs/tasks, correspondingly.

Pretraining

For the pretraining stage, we use the BlipCapFilt and COYO datasets. Given their large size, we recommend downloading them according to the guidelines provided by here and storing them in the webdataset format.

Please note that we employ a filtered subset of the original COYO-700M dataset, specifically the COYO100M subset. This subset excludes image-text pairs with a CLIP similarity score below 0.3, as determined using the CLIP ViT-B/32.

Finetuning

Please download the datasets for finetuning from their official sources:

VQA (open-ended): VQAv2, GQA, OCRVQA, VSR
VQA (multiple choices): ScienceQA, A-OKVQA
Referring expression comprehension: RefCOCO, RefCOCO+, RefCOCOg, VisualGenome
Instruction: LLaVA150K, ShareGPT

Evaluation

Please follow the official guidelines to prepare benchmark datasets: MMB, MME, SEED-Bench, ScienceQA, LLaVABench, MMVet, MMMU, POPE, and OwlEval.

For GPT-based evaluation, including LLaVABench, MMVet, and MMB (gpt matcher), OpenAI API information should be filled in tasks/llavabench/gpt_eval.py, tasks/mm_vet/mmbet_eval.py, and tasks/mmb/eval_mmb_gpt.py, respectively.

Example Commands

### Pretraining
bash scripts/pt.sh {exp_name} ${args1} ${args2} ...

### Finetuning
bash scripts/ft.sh -p {pretrained_ckpt} {exp_name} ${args1} ${args2} ...

### Evaluation
bash scripts/eval_all.sh {ckpt path}

Examples of easy run for instruction tuning with various dataset combination

Please carefully follow the quotation mark usage in the example below.
e.g., When defining data_config/train_dataset, SHOULD wrap it with single quotation marks (').

# Examples
pretrained_ckpt=<path_to_pretrained_ckpt>
ft_output_dir="output/ft/<path_to_output>"
mkdir -p ${ft_output_dir}

# 1st example: sampling_weights with single quotation marks
deepspeed ./train.py \
	--config-name=finetune output_dir=${ft_output_dir} pretrained_ckpt=${pretrained_ckpt} \
	'data_config/train_dataset=[llava150k,sqa,vicuna40k]' \
	data_config.train_cfg.sampling_weights='[0.5, 0.2, 0.3]' \
	2>&1 | tee ${ft_output_dir}/train.log

# 2nd example: sampling_weights without single quotation marks; there should be no spaces between values.
deepspeed ./train.py \
	--config-name=finetune output_dir=${ft_output_dir} pretrained_ckpt=${pretrained_ckpt} \
	'data_config/train_dataset=[llava150k,sqa,vicuna40k]' \
	data_config.train_cfg.sampling_weights=[0.5,0.2,0.3] \
	2>&1 | tee ${ft_output_dir}/train.log

Strict Reproduction of Official Results

We utilized batch inference in our evaluation to accelerate experiments. The batch inference does not significantly change average scores, but individual scores may vary slightly (about ±0.1~0.2). To strictly reproduce the official results, the use of 8 devices (GPUs) is required; the number of devices influences batch construction, affecting the final scores. We used the default batch size specified in each task config, except for the largest model (Honeybee-C-13B-M576) where we used B=8 due to memory constraints.

Inference and Demo

Example code for the inference is provided in inference_example.ipynb. The example images in ./examples are adopted from mPLUG-Owl.

We also provide gradio demo:

python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last

Citation

@inproceedings{cha2023honeybee,
  title={Honeybee: Locality-enhanced Projector for Multimodal LLM},
  author={Junbum Cha and Wooyoung Kang and Jonghwan Mun and Byungseok Roh},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

License

The source code is licensed under Apache 2.0 License.
The pretrained weights are licensed under CC-BY-NC 4.0 License.

Acknowledgement: this project is developed based on mPLUG-Owl, which is also under the Apache 2.0 License.

Disclaimer

Kakao Brain "Honeybee" is the name of the Multimodal Large Language Model (MLLM) open source project, not the customer service brand.

kakaobrain / honeybee