OPPOMKLab/u-LLaVA

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Multi-modal multi task LLM
Documentation | 中文文档

Paper · Report Bug · Request Feature

🎉 News

[2024/07] We will release the grounding&segmentation weights soon.
[2024/07] ViT-336 supports, MM-Bench, TextVQA, SQA, GQA involves, coming soon (before Aug).
[2024/07] Salient-15k is released.
[2024/07] The work is accepted by ECAI 2024 Main Track!
[2024/01] The code and segmentation weights are released.
[2023/10] The paper is released.

Table of Contents

About The Project
- Features
Results
Getting Started
License
Citation
Acknowledgments

About The Project

Structure:

Examples

(back to top)

Demo is coming soon.

Features

Code

Task

(back to top)

Model Release

Models	Images/Videos
u-LLaVA	uLLaVA Stage 2

RESULTS

RES

REC

SALIENT

General MLLM

Fine-tune	ScienceQA	MM-Bench	Seed-Bench
u-LLaVA-7B	87.74	soon	soon

Video QA

zero-shot	Accuracy (Type 3)
Activity-QA	51.70%

Getting Started

Requirements

Run the following commands in terminal:

pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..

Why do these?

install requirements: pip install -r requirements.txt
build cuda core for GroundingDINO: cd ./models/GroundingDINO && ./install.sh && cd ../.., if not may arise UserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")

Datasets

Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions

Image storage (download link can be found in the table):

image_root
├─ade20k
│  ├─annotations
│  └─images
├─coco2014
│  ├─test2014
│  ├─train2014
│  └─val2014
├─coco2017
│  ├─annotations
│  ├─train2017
│  └─val2017
├─cocostuff
│  ├─train2017
│  └─val2017
├─LLaVA-CC3M-Pretrain-595K
│  └─images
├─saiapr_tc-12
│  ├─00
│  └─01
└─vlpart
    ├─paco
    │  └─annotations
    └─pascal-part
        ├─Annotations_Part
        ├─examples
        └─VOCdevkit

where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.

Stage I: Pre-training

Dataset	Images/Videos	Annotations
LLaVA CC3M	LLaVA-CC3M-Pretrain-595K/image.zip	chat.json
TGIF	TGIF - Quark Drive	tgif.json

Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.

Stage II: Fine-tuning

Dataset	Images	Annotations
LLaVA Instruction 150K	coco2017	llava_instruct_150k.json
RefCOCO	coco2014	refcoco_train.json
RefCOCOg	coco2014	refcocog_train.json
RefCOCO+	coco2014	refcoco+_train.json
RefCLEF	saiapr_tc-12	refclef_train.json
ADE20K	ade20k	ade20k.json
COCO Stuff	cocostuff	cocostuff.json
VOC2010	voc2010	pascal_part.json
PACO LVIS	paco	paco_lvis.json
Salient 15K	msra	ullava_salinet_15k.json

Note: Please download the images of MSRA-10K and MSRA-B from the official site, thanks the authors for sharing.

Dataset config example

dataset:
  llava:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/llava_instruct_150k.json'
      image_dir: '/path_to_image_root/coco2017/train2017'
      portion: 1.0
    vis_processor: 'clip_image'

  refcoco+:
    data_type: 'image'
    image_token_len: 256
    build_info:
      anno_dir: '/path_to_annotations/refcoco+_train.json'
      image_dir: '/path_to_image_root/coco2014'
      template_root: './datasets/templates/SEG.json'
      portion: 1.0
    vis_processor: 'clip_image'

Note:

We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.

Training

Stage I: Pre-training

Prepare Open-Source LLaMA models

Foundation model	Version	Path
Vicuna 7B HF	V1.1	vicuna_7b_v1.1
LLaMA2 7B HF	-	meta-llama/Llama-2-7b-hf
SAM	ViT-H	sam_vit_h_4b8939.pth
GroundingDINO	swint_ogc	groundingdino_swint_ogc.pth

Note:

- LLaMA2 is trained with bf16, convergence error may happen when stage 1 training with fp16.

- The default tokenizer.legacy of Llama-2 is False, and may rise tokenization mismatch error with some conversation template.

- Errata: The base LLM used in the paper is Vicuna-v1.1, not LLaMA2. Sorry about the mistake.

Prepare datasets
Set config in

configs/train/ullava_core_stage1.yaml

Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs

./shells/pretrain.sh

or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml' for 1 GPU.

The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'

Stage II: Fine-tuning

After Stage I training finished, we can go through the following step, that is, fine-tuning.

Prepare datasets
Set config in

configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)

Train Stage II with multi GPUs

./shells/finetune.sh

or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml' for 1 GPU.

Common Question

Q1: What conv_tpye used in training?

A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'

Q2: When LoRA used?

A2: Stage I: We have not used in this stage. Stage II: According to your devices.

(back to top)

Evaluation

Batch evaluation

Set config

configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)

Run

python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)

(back to top)

Qualitative inference

Modify the parser in the evaluation/inference_ullava_core.py and evaluation/inference_ullava.py for stage I and stage II, respectively.

python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py

(back to top)

License

Distributed under the Apache License. See LICENSE for more information.

(back to top)

Citation

@inproceedings{xu2024ullava,
  title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
  author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
  booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
  year={2024}
}

(back to top)

TODO

Visual Segmentation
- Instance Segmentation

(back to top)

Acknowledgments

We sincerely thank the open source community for their contributions. And this work is sponsored by Shanghai Pujiang Program (23PJ1421800).

(back to top)

See the open issues for a full list of proposed features (and known issues).

(back to top)

About

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Apache License 2.0

Languages

Language:Python 89.6%Language:Cuda 9.3%Language:C++ 1.0%Language:Shell 0.1%