Empowering Multimodal Large Language Model as a Powerful Data Generator

*An Automatic Visual Instruction Tuning Data Generation Pipeline.

Overview

💡Key Contributions:

Pipeline - We introduce an innovative multimodal data generation pipeline, Genixer, that inlcudes four steps: Instruction Data Collection, Instruction Template Design, Training MLLMs and Data Generation & Filtering.
Two Data Generators - $\text{Genixer}_L$ and $\text{Genixer}_S$.
Two Synthetic Datasets - 915K VQA-like data and 350K REC-like data.

Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Genixer Pipeline

Genixer contains four key steps, including 1) instruction data collection, 2) instruction template design, 3) empowering current MLLMs, and 4) data generation and filtering.

Instruction Data Collection

In accordance with the prevalence and practical relevance of real-world multi-modal tasks, we have carefully selected 9 representative multimodal tasks as listed in the following table for corresponding data generation. We categorize the VL tasks into two types: 4 Generic tasks and 5 grounding tasks.

Overview of collected data for training two data generators.

Data Filtering

The illustration of proposed Fuyu-driven data filtering framework. The outputs of the framework compose a probability and a direct answer.

Results

Comparison with SoTA methods on 12 benchmarks.

🧸 Samples of Generated Data

Selected examples generated from $\text{Genixer}_L$ and $\text{Genixer}_S$. The examples include Common VQA, Adv VQA, MC VQA, MD, and five grounding tasks.

58 Handwritten Generic Instructions

For the generic instructions used in training Genixer, please refer to the path Genixer_Shikra/config/_base_/dataset/template/GenQA_general_instructions.json for the details.

Genixer with LLaVA

Install

cd Genixer_LLaVA
conda create -n genixerL python=3.10 -y
conda activate genixerL
pip install --upgrade pip
pip install -e .

Model Weights

Model Name	Checkpoints
Genixer-llava-v1.5-7b	Model weights
llava-Genixer-915K-FT-8K-v1.5-7b	Model weights

Image Datasets

Please download the images from constituting datasets:

COCO: train2014
GQA: images
OCR-VQA: download script, we save all files as .jpg
AOKVQA: download script
TextVQA: train_val_images
VisualGenome: part1, part2
LLaVA-CC3M-Pretrain-595K: huggingface
LLaVA-Pretrain: huggingface
LLaVA-Instruct: huggingface
Flickr30K: Kaggle

Training data for $\text{Genixer}_L$

TrainDataforGenixerLLaVA.jsonl: 1M instruction tuning data for training the $\text{Genixer}_L$ with the capability of generating diverse data types.

Synthetic Data

Genixer_915K.jsonl: This is the synthetic instruction tuning data generated by our trained $\text{Genixer}_L$.

Moreover, we provide additional two synthetic pretraining datasets mentioned in ablation study for your preference:

Genixer_300K.jsonl

Genixer_610K.jsonl

Evaluation for $\text{Genixer}_L$

Download model weight Genixer-llava-v1.5-7b under the folder checkpoints.
Run evaluation on Flickr30K unannotated images with generic data type, please refer to the script scripts/eval_genixer/generic_generation.sh.

CHUNKS=8
CKPT=Genixer-llava-v1.5-7b

qfile=data/flickr30k_imagequery.jsonl
imgdir=/yourpath/flickr30k/flickr30k_images/flickr30k_images
datatype=flickr30k_tem0.2
tasktype=generic

for IDX in $(seq 0 $((CHUNKS-1))); do
    CUDA_VISIBLE_DEVICES=$IDX python -m model_genixer_eval \
        --model-path checkpoints/$CKPT \
        --question-file $qfile \
        --image-folder $imgdir \
        --answers-file ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl \
        --task-type $tasktype \
        --num-chunks $CHUNKS \
        --chunk-idx $IDX \
        --temperature 0.2 \
        --conv-mode vicuna_v1 &
done

wait

output_file=./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/merge.jsonl
> "$output_file"

for IDX in $(seq 0 $((CHUNKS-1))); do
    cat ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

More evaluation scripts can be found in scripts/eval_genixer.

Training for $\text{Genixer}_L$

Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.
Download the model weight llava-v1.5-7b under the folder checkpoints.
Preparing the TrainDataforGenixerLLaVA.jsonl under the folder data.
Run the training script bash scripts/train_genixer.sh

#!/bin/bash
outputdir=exp/llava-v1.5-7b-Genixer

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path checkpoints/llava-v1.5-7b \
    --version v1 \
    --data_path ./data/TrainDataforGenixerLLaVA.jsonl \
    --image_folder ./data \
    --vision_tower checkpoints/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir $outputdir \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Training LLaVA1.5 with 915K synthetic data

Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.
Download the model weight vicuna-7b-v1.5 under the folder checkpoints.
Download the synthetic pretraining data Genixer_915K.jsonl under the folder data.
Download the mixture finetuning data llava_mix665k_synthetic_8k.jsonl under the folder data.
Run the pretraining script.

bash scripts/pretrain.sh

Run the finetuing script.

bash scripts/finetune.sh

Evaluation on 12 Multimodal Benchmarks

Download llava-Genixer-915K-FT-8K-v1.5-7b under the folder checkpoints.
Following the data preparation steps from here.

Take VizWiz as an example, you just need to set the modelname of downloaded model and ensure the correctness of the path of image folder.

modelname=llava-Genixer-915K-FT-8K-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path exp/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /dataset/lavis/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Genixer with Shikra

Install

cd Genixer_Shikra
conda create -n GenixerS python=3.10
conda activate GenixerS
pip install -r requirements.txt

Model Weights

Model Name	Checkpoints
Genixer-shikra-7b	Coming soon
shikra-Genixer-350K-7b	Model weights

Image Datasets

COCO: train2014
GQA: images
VisualGenome: part1, part2
LLaVA-CC3M-Pretrain-595K: huggingface
LLaVA-Pretrain: huggingface
LLaVA-Instruct: huggingface
Flickr30K: Kaggle
SBU: images

Training Data

Download the original annotation data from here and put it under data.

Please refer to the file Genixer_Shikra/config/_base_/dataset/DEFAULT_TRAIN_DATASET.py to replace yourpath with the exact folder path on your machine.

genrecdata=dict(
        type='GenRECDataset',
        filename=r'{{fileDirname}}/../../../data/REC_ref3_train.jsonl',
        image_folder=r'/yourpath/coco2014/train2014',
        template_file=r"{{fileDirname}}/template/GenQA_general_instructions.json",
    ),

Synthetic Data

We use $\text{Genixer}_S$ to generate two REC-like datasets syn_lcs_filtered60.jsonl, syn_sbu_filtered60.jsonl with a total of 350K samples.

Evaluation for $\text{Genixer}_S$

Download the model weight of Genixer-shikra-7b under the folder checkpoints.
Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.
Run the script run_eval_genixer.sh.

accelerate launch --num_processes 8 \
    --main_process_port 23782 \
    mllm/pipeline/finetune.py \
    config/genixer_eval_GenQA.py \
    --cfg-options model_args.model_name_or_path=checkpoints/Genixer-shikra-7b \
    training_args.output_dir=results/Genixer-shikra-7b

Training for $\text{Genixer}_S$

Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.
Download the LLM model weight vicuna-7b-v1.1 under the folder checkpoints.
Download the delta model shikra-7b-delta-v1 of Shikra.
Transform the delta model to shikra-7b-v1.1 with the command bash model_transform.sh.

python mllm/models/models/apply_delta.py \
    --base /yourpath/vicuna-7b-v1.1 \
    --target checkpoints/shikra-7b-v1.1 \
    --delta checkpoints/shikra-7b-delta-v1

Run the stage-1 training script.

bash run_genixer_stage1.sh

Run the stage-2 training script.

bash run_genixer_stage2.sh

Training Shikra with 350K Synthetic Data

Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.
Download the LLM model weight vicuna-7b-v1.1 under the folder checkpoints.
Run the script for the stage-0 pretraining.

bash run_genixer_shikra_stage0.sh

Run the script for the stage-1 pretraining.

bash run_genixer_shikra_stage1.sh

Run the script for the stage-2 pretraining.

bash run_genixer_shikra_stage2.sh

Evaluation on REC Tasks

Download the model shikra-Genixer-350K-7b under the folder checkpoints.
Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.
Run the script bash run_eval_rec.sh.

accelerate launch --num_processes 8 \
    --main_process_port 23782 \
    mllm/pipeline/finetune.py \
    config/eval_multi_rec.py \
    --cfg-options model_args.model_name_or_path=checkpoints/shikra-Genixer-350K-7b \
    training_args.output_dir=results/shikra-Genixer-350K-7b

Fuyu-Driven Data Filtering

We prepare the code of using Fuyu-8B as the data filtering in the file Genixer_LLaVA/fuyudatafiltering/GenQA_filtering_mp.py

Run the following command for multi-GPU data filtering.

bash scripts/fuyudatafilter.sh

CLIP-Driven REC Data Filtering

We run the CLIP-Driven REC data filtering with this script multiprocess_evalclipscore.py.

bash Genixer_Shikra/multiprocess_evalclipscore.py

Acknowledgement

LLaVA: the codebase we built upon.
Shikra: the codebase we built upon.

zhaohengyuan1 / Genixer

Empowering Multimodal Large Language Model as a Powerful Data Generator

Overview

💡Key Contributions:

Genixer Pipeline

Instruction Data Collection

Data Filtering

Results

🧸 Samples of Generated Data

58 Handwritten Generic Instructions

Genixer with LLaVA

Install

Model Weights

Image Datasets

Training data for $\text{Genixer}_L$

Synthetic Data

Evaluation for $\text{Genixer}_L$

Training for $\text{Genixer}_L$

Training LLaVA1.5 with 915K synthetic data

Evaluation on 12 Multimodal Benchmarks

Genixer with Shikra

Install

Model Weights

Image Datasets

Training Data

Synthetic Data

Evaluation for $\text{Genixer}_S$

Training for $\text{Genixer}_S$

Training Shikra with 350K Synthetic Data

Evaluation on REC Tasks

Fuyu-Driven Data Filtering

CLIP-Driven REC Data Filtering

Acknowledgement

About

Languages