LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of AI

📢 Latest Updates

Apr-26-24- Phi-3-V and LLaVA-3-V released: Excited to release the new integration of LLaVA with Phi-3 Mini Instruct and LLaMA-3 Instruct models! 🔥🔥🔥

💬 Introduction

This repository enhances the capabilities of the LLaVA 1.5 model, incorporating latest LLMs released this weak🔥, Phi-3 Mini Instruct 3.8B, and LLaMA-3 Instruct 8B.

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

Model	MMMU	POPE	MME	MMBench-en	MMBench-cn	SEED-all	SEED-img	SEED-vid	LLaVA-Wild	GQA	Science-QA	Average
LLaVA-v1.5-7B	35.4	85.8	1510.7	64.3	58.3	58.6	66.1	37.3	65.4	62.0	66.8	58.9
LLaVA-v1.5-13B	36.4	85.9	1531.3	67.7	63.6	61.6	68.2	42.7	72.5	63.3	71.6	62.3
Phi-3-V-mini-3.8B	37.8	85.6	1470.1	68.2	68.1	62.8	67.7	44.5	70.9	61.7	80.7	63.2

🌟 LLaMA-3-V-8B results and models - coming soon!

*Average computed excluding MME

🤖 Model-Zoo

The following table provides an overview of the available models in our zoo. For each model, you can find links to its Hugging Face page.

Model Name	Hugging Face Link	Summary
LLaVA-Phi-3-mini-4k-instruct-pretrain	HF	Pretrained on LCS-558K.
LLaVA-Phi-3-mini-4k-instruct-lora	HF	LoRA weights fine-tuned on LLaVA-Instruct-665K
LLaVA-Phi-3-mini-4k-instruct	HF	Merged weights in HuggingFace format.

Installation

git clone https://github.com/mbzuai-oryx/LLaVA-pp.git
cd LLaVA-pp
git submodule update --init --recursive

Packages you need to update from LLAVA:

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

🚀 Phi-3-V

To integrate Phi-3-V with LLaVA, follow these steps to update the codebase:

# Copy necessary files
cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/llava_phi3.py LLaVA/llava/model/language_model/llava_phi3.py
cp Phi-3-V/builder.py LLaVA/llava/model/builder.py
cp Phi-3-V/model__init__.py LLaVA/llava/model/__init__.py
cp Phi-3-V/main__init__.py LLaVA/llava/__init__.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

# Training commands
cp scripts/Phi3-V_pretrain.sh LLaVA/Vi-phi3_pretrain.sh
cp scripts/Phi3-V_finetune_lora.sh LLaVA/Vi-phi3_finetune_lora.sh

Train Phi-3-V

Pre-train

cd LLaVA
bash Phi3-V_pretrain.sh

Finetune

cd LLaVA
bash Phi3-V_finetune_lora.sh

🚀 LLaMA-3-V

To integrate LLaMA-3-V with LLaVA, follow these steps to update the codebase:

# Copy necessary files
cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

# Training commands
cp scripts/LLaMA3-V_pretrain.sh LLaVA/LLaMA3-V_pretrain.sh
cp scripts/LLaMA3-V_finetune_lora.sh LLaVA/LLaMA3-V_finetune_lora.sh

Train LLaMA-3-V

Pre-train

cd LLaVA
bash LLaMA3-V_pretrain.sh

Finetune

cd LLaVA
bash LLaMA3-V_finetune_lora.sh

🙏 Acknowledgement

We are thankful to LLaVA, and lmms-eval for releasing their models and code as open-source contributions.

mmaaz60 / LLaVA-pp

LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan

Mohamed bin Zayed University of AI

📢 Latest Updates

💬 Introduction

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

🤖 Model-Zoo

Installation

🚀 Phi-3-V

Train Phi-3-V

🚀 LLaMA-3-V

Train LLaMA-3-V

🙏 Acknowledgement

About

Languages

LLaVA++: Extending Visual Capabilities with LLaMA-3 and Phi-3

Hanoona Rasheed*, Muhammad Maaz*, Salman Khan, and Fahad Khan

Mohamed bin Zayed University of AI

📢 Latest Updates

💬 Introduction

🏆 Results: Phi-3-V and LLaVA-3-V

Comparison on Benchmarks for Instruction-following LMMS & academic-task-oriented datasets:

🤖 Model-Zoo

Installation

🚀 Phi-3-V

Train Phi-3-V

🚀 LLaMA-3-V

Train LLaMA-3-V

🙏 Acknowledgement

About

Languages

Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Khan