InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B

[Update Blog] [Paper] [Chat Demo] [Quick Start] [中文解读]

News🚀🚀🚀

2024/04/18: InternVL-Chat-V1.5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
2024/02/27: InternVL is accepted by CVPR 2024! 🎉
2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
2024/02/21: InternVL-Chat-V1.2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
2024/02/12: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog, SFT data or try our demo. The model is now available on HuggingFace, and both training/evaluation data and scripts are open-sourced.
2024/02/04: InternVL-Chat-V1.1 achieves 44.67% on MMVP, higher than GPT-4V!
2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, see here.
2024/01/24: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see here or try our demo.
2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.

Compared with SOTA VLLMs

name	image size	MMMU (val)	MMMU (test)	ChartQA (testavg)	DocVQA (test)	AI2D (test)	MathVista (testmini)	InfoVQA (test)	MMB (test)	MMB−CN (test)	MMVP	MME	ScienceQA (image)	POPE	TextVQA	SEEDv1 (image)	VizWiz (test)	GQA (test)	VQAv2	OCRBench
GPT−4V*	unknown	56.8	55.7	78.5	88.4	78.2	49.9	75.1	77.0	74.4	38.7	1409/517	-	-	78.0	71.6	-	-	77.2	516
Gemini Ultra*	unknown	59.4	-	80.8	90.9	79.5	53.0	80.3	-	-	-	-	-	-	82.3	-	-	-	77.8	-
Gemini Pro*	unknown	47.9	-	74.1	88.1	73.9	45.2	75.2	73.6	74.3	40.7	1497/437	-	-	74.6	70.7	-	-	71.2	-
Qwen−VL−Plus*	unknown	45.2	40.8	78.1	91.4	75.9	43.3	-	67.0	70.7	-	1681/502	-	-	78.9	65.7	-	-	-	726
Qwen−VL−Max*	unknown	51.4	46.8	79.8	93.1	79.3	51.0	-	77.6	75.7	-	-	-	-	79.5	-	-	-	-	640

LLaVA−NEXT−34B	672x672	51.1	44.7	-	-	-	46.5	-	79.3	79.0	-	1631/397	81.8	87.7	69.5	75.9	63.8	67.1		-
InternVL−Chat−V1.2	448x448	51.6	46.2	-	-	-	47.7	-	82.2	81.2	56.7	1672/509	83.3	88.0	69.7	75.6	60.0	64.0		-
InternVL−Chat−V1.5	4K	48.8	-	83.8	90.4	80.7	53.5	72.4	82.2	82.0	57.3	1638/550	94.0	88.3	80.5	76.0	-	65.7		724

* denotes proprietary models.

What is InternVL?

InternVL scales up the ViT to 6B parameters and aligns it with LLM.

Model Zoo

Vision Large Language Model

Model	Date	Download	Note
InternVL-Chat-V1.5	2024.04.18	🤗 HF link	4K image input; Stronger OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)
InternVL-Chat-V1.2-Plus	2024.02.21	🤗 HF link	more SFT data and stronger
InternVL-Chat-V1.2	2024.02.11	🤗 HF link	scaling up LLM to 34B
InternVL-Chat-V1.1	2024.01.24	🤗 HF link	support Chinese and stronger OCR
InternVL-Chat-19B-448px	2024.02.03	🤗 HF link	448 resolution
InternVL-Chat-19B	2023.12.25	🤗 HF link	English multimodal dialogue
InternVL-Chat-13B	2023.12.25	🤗 HF link	English multimodal dialogue

Vision-Language Foundation Model

Model	Date	Download	Note
InternViT-6B-448px-V1.2	2024.02.11	🤗 HF link	448 resolution (🔥new)
InternViT-6B-448px	2024.01.30	🤗 HF link	448 resolution
InternViT-6B-224px	2023.12.22	🤗 HF link	vision foundation model
InternVL-14B-224px	2023.12.22	🤗 HF link	vision-language foundation model

What can InternVL do?

Visual Perception (click to expand)

Linear-Probe Image Classification [see details]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	−
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

Semantic Segmentation [see details]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

Zero-Shot Image Classification [see details]

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	−	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

Multilingual Zero-Shot Image Classification [see details]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

Zero-Shot Video Classification [see details]

method #frame K400 K600 K700

OpenCLIP-G 1 65.9 66.1 59.2

EVA-02-CLIP-E+ 1 69.8 69.3 63.4

InternVL-C (ours) 1 71.0 71.3 65.7

ViCLIP 8 75.7 73.5 66.4

InternVL-C (ours) 8 79.4 78.8 71.5

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

Cross-Modal Retrieval (click to expand)

English Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K						COCO						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C (ours)	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G (ours)	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

Chinese Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K-CN						COCO-CN						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C (ours)	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G (ours)	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

Multilingual Zero-Shot Image-Text Retrieval on XTD [see details]

method	EN	ES	FR	ZH	IT	KO	RU	JP	average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C (ours)	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G (ours)	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

Multimodal Dialogue (click to expand)

Zero-Shot Image Captioning [see details]

method COCO Flickr30K NoCaps

Emu-I 117.7 - -

DreamLLM 115.4 - -

InternVL-G (ours) 128.2 79.2 113.7

method	COCO	Flickr30K	NoCaps
Emu-I	117.7	-	-
DreamLLM	115.4	-	-
InternVL-G (ours)	128.2	79.2	113.7

Multimodal Benchmarks with Frozen LLM [see details]

method	visual encoder	glue layer	LLM	res.	COCO	Flickr	NoCaps	VQAv2	GQA	VizWiz	TextVQA	MME	POPE
InstructBLIP	EVA-g	QFormer	V-7B	224	–	82.4	123.1	–	49.2	34.5	50.1	–	–
BLIP-2	EVA-g	QFormer	V-13B	224	–	71.6	103.9	41.0	41.0	19.6	42.5	1293.8	85.3
InstructBLIP	EVA-g	QFormer	V-13B	224	–	82.8	121.9	–	49.5	33.4	50.7	1212.8	78.9
InternVL-Chat (ours)	IViT-6B	QLLaMA	V-7B	224	141.4	89.7	120.5	72.3	57.7	44.5	42.1	1298.5	85.2
InternVL-Chat (ours)	IViT-6B	QLLaMA	V-13B	224	142.4	89.9	123.1	71.7	59.5	54.0	49.1	1317.2	85.4

Multimodal Benchmarks with Trainable LLM [see details]

method	vision encoder	LLM	res.	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MMB	MMB_CN	MMVet
LLaVA-1.5	CLIP-L-336px	V-7B	336	78.5	62.0	50.0	66.8	58.2	85.9	1510.7	64.3	58.3	30.5
LLaVA-1.5	CLIP-L-336px	V-13B	336	80.0	63.3	53.6	71.6	61.3	85.9	1531.3	67.7	63.6	35.4
InternVL-Chat (ours)	IViT-6B-224px	V-7B	336	79.3	62.9	52.5	66.2	57.0	86.4	1525.1	64.6	57.6	31.2
InternVL-Chat (ours)	IViT-6B-224px	V-13B	336	80.2	63.9	54.6	70.1	58.7	87.1	1546.9	66.5	61.9	33.7
InternVL-Chat (ours)	IViT-6B-448px	V-13B	448	82.0	64.1	60.1	71.6	64.8	87.2	1579.0	68.2	64.0	36.7

Tiny LVLM [see details]

Rank	Model	Version	Score
🏅️	InternVL	InternVL-Chat	327.61
🥈	InternLM-XComposer-VL	InternLM-XComposer-VL-7B	322.51
🥉	Bard	Bard	319.59
4	Qwen-VL-Chat	Qwen-VL-Chat	316.81
5	LLaVA-1.5	Vicuna-7B	307.17
6	InstructBLIP	Vicuna-7B	300.64
7	InternLM-XComposer	InternLM-XComposer-7B	288.89
8	BLIP2	FlanT5xl	284.72
9	BLIVA	Vicuna-7B	284.17
10	Lynx	Vicuna-7B	279.24

Installation

See INSTALLATION.md

Quick Start with Huggingface

using InternViT-6B (click to expand)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

using InternVL-C(ontrastive) and InternVL-G(enerative) (click to expand)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

using InternVL-Chat (click to expand)

Single GPU

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer

path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path)
image = Image.open('./examples/image2.jpg').convert('RGB')
image = image.resize((448, 448))
image_processor = CLIPImageProcessor.from_pretrained(path)

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=512,
    do_sample=False,
)

question = "请详细描述图片"
response = model.chat(tokenizer, pixel_values, question, generation_config)

Multiple GPUs

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer

path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map='auto').eval()

tokenizer = AutoTokenizer.from_pretrained(path)
image = Image.open('./examples/image2.jpg').convert('RGB')
image = image.resize((448, 448))
image_processor = CLIPImageProcessor.from_pretrained(path)

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=512,
    do_sample=False,
)

question = "请详细描述图片"
response = model.chat(tokenizer, pixel_values, question, generation_config)

Chat Web Demo

Launch a local chat demo (click to expand)

Launch a controller

# run the command in the `internvl_chat_llava` folder
python -m llava.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server

# run the command in the `internvl_chat_llava` folder
python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

Launch a model worker

# OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B
# run the command in the `internvl_chat_llava` folder
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./path/to/InternVL-Chat-ViT-6B-Vicuna-7B

# OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B
# run the command in the `internvl_chat_llava` folder
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40001 --worker http://localhost:40001 --model-path ./path/to/InternVL-Chat-ViT-6B-Vicuna-13B

# OpenGVLab/InternVL-Chat-Chinese-V1-1
# run the command in the `internvl_chat` folder
python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40002 --worker http://localhost:40002 --model-path ./path/to/InternVL-Chat-Chinese-V1-1

# OpenGVLab/InternVL-Chat-Chinese-V1-2
# run the command in the `internvl_chat` folder
python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40003 --worker http://localhost:40003 --model-path ./path/to/InternVL-Chat-Chinese-V1-2

# OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus
# run the command in the `internvl_chat` folder
python -m internvl.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40004 --worker http://localhost:40004 --model-path ./path/to/InternVL-Chat-Chinese-V1-2-Plus

Schedule

Release high-resolution models
Release InternVL-Chat
Release InternVL-C(ontrastive) and InternVL-G(enerative)
Release InternViT-6B

License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}

Acknowledgement

InternVL is built with reference to the code of the following projects: OpenAI CLIP, Open CLIP, CLIP Benchmark, EVA, InternImage, ViT-Adapter, MMSegmentation, Transformers, DINOv2, BLIP-2, Qwen-VL, and LLaVA-1.5. Thanks for their awesome work!

If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend:

About

[CVPR 2024 Oral] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B

https://arxiv.org/abs/2312.14238

MIT License

Languages

Language:Jupyter Notebook 49.5%Language:Python 47.9%Language:Shell 2.2%Language:JavaScript 0.2%Language:HTML 0.2%Language:Makefile 0.1%Language:CSS 0.0%