InternImage

This repository is an official implementation of the InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

Paper | Blog in Chinese

News

Feb 28, 2023: InternImage is accepted to CVPR 2023!
Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
Nov 10, 2022: 🚀🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

Coming soon

TensorRT inference.
Other downstream tasks.
Classification code of the InternImage series.
InternImage-T/S/B/L/XL ImageNet-1k pretrained model.
InternImage-L/XL ImageNet-22k pretrained model.
InternImage-T/S/B/L/XL detection and instance segmentation model.
InternImage-T/S/B/L/XL semantic segmentation model.

Introduction

InternImage, initially described in arxiv, can be a general backbone for computer vision. It takes deformable convolution as the core operator to obtain large effective receptive fields, and introducing adaptive spatial aggregation to reduces the strict inductive bias. Our model makes it possible to learn more stronger and robust models with large-scale parameters from massive data.

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained InternImage Models

name	pretrain	resolution	acc@1	#params	FLOPs	22K model	1K model
InternImage-T	ImageNet-1K	224x224	83.5	30M	5G	-	ckpt \| cfg
InternImage-S	ImageNet-1K	224x224	84.2	50M	8G	-	ckpt \| cfg
InternImage-B	ImageNet-1K	224x224	84.9	97M	16G	-	ckpt \| cfg
InternImage-L	ImageNet-22K	384x384	87.7	223M	108G	ckpt	ckpt \| cfg
InternImage-XL	ImageNet-22K	384x384	88.0	335M	163G	ckpt	ckpt \| cfg

Main Results on Downstream Tasks

COCO Object Detection

backbone	method	schd	box mAP	mask mAP	#params	FLOPs	Download
InternImage-T	Mask R-CNN	1x	47.2	42.5	49M	270G	ckpt \| cfg
InternImage-T	Mask R-CNN	3x	49.1	43.7	49M	270G	ckpt \| cfg
InternImage-S	Mask R-CNN	1x	47.8	43.3	69M	340G	ckpt \| cfg
InternImage-S	Mask R-CNN	3x	49.7	44.5	69M	340G	ckpt \| cfg
InternImage-B	Mask R-CNN	1x	48.8	44.0	115M	501G	ckpt \| cfg
InternImage-B	Mask R-CNN	3x	50.3	44.8	115M	501G	ckpt \| cfg
InternImage-L	Cascade	1x	54.9	47.7	277M	1399G	ckpt \| cfg
InternImage-L	Cascade	3x	56.1	48.5	277M	1399G	ckpt \| cfg
InternImage-XL	Cascade	1x	55.3	48.1	387M	1782G	ckpt \| cfg
InternImage-XL	Cascade	3x	56.2	48.8	387M	1782G	ckpt \| cfg

ADE20K Semantic Segmentation

backbone	resolution	single scale	multi scale	#params	FLOPs	Download
InternImage-T	512x512	47.9	48.1	59M	944G	ckpt \| cfg
InternImage-S	512x512	50.1	50.9	80M	1017G	ckpt \| cfg
InternImage-B	512x512	50.8	51.3	128M	1185G	ckpt \| cfg
InternImage-L	640x640	53.9	54.1	256M	2526G	ckpt \| cfg
InternImage-XL	640x640	55.0	55.3	368M	3142G	ckpt \| cfg

Main Results of FPS

name	resolution	#params	FLOPs	Batch 1 FPS(PyTorch)	Batch 1 FPS(TensorRT)
InternImage-T	224x224	30M	5G	44	156
InternImage-S	224x224	50M	8G	40	129
InternImage-B	224x224	97M	16G	40	116
InternImage-L	384x384	223M	108G	40	56
InternImage-XL	384x384	335M	163G	32	47

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

About

MIT License

Languages

Language:Python 60.9%Language:Cuda 33.3%Language:C++ 4.7%Language:Shell 1.1%