chenjie04 / InternImage

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

InternImage

PWC PWC PWC PWC PWC PWC

This repository is an official implementation of the InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

Paper | Blog in Chinese

News

  • Feb 28, 2023: InternImage is accepted to CVPR 2023!
  • Nov 18, 2022: ๐Ÿš€ InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
  • Nov 10, 2022: ๐Ÿš€๐Ÿš€ InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

Coming soon

  • TensorRT inference.
  • Other downstream tasks.
  • Classification code of the InternImage series.
  • InternImage-T/S/B/L/XL ImageNet-1k pretrained model.
  • InternImage-L/XL ImageNet-22k pretrained model.
  • InternImage-T/S/B/L/XL detection and instance segmentation model.
  • InternImage-T/S/B/L/XL semantic segmentation model.

Introduction

InternImage, initially described in arxiv, can be a general backbone for computer vision. It takes deformable convolution as the core operator to obtain large effective receptive fields, and introducing adaptive spatial aggregation to reduces the strict inductive bias. Our model makes it possible to learn more stronger and robust models with large-scale parameters from massive data.

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained InternImage Models

name pretrain resolution acc@1 #params FLOPs 22K model 1K model
InternImage-T ImageNet-1K 224x224 83.5 30M 5G - ckpt | cfg
InternImage-S ImageNet-1K 224x224 84.2 50M 8G - ckpt | cfg
InternImage-B ImageNet-1K 224x224 84.9 97M 16G - ckpt | cfg
InternImage-L ImageNet-22K 384x384 87.7 223M 108G ckpt ckpt | cfg
InternImage-XL ImageNet-22K 384x384 88.0 335M 163G ckpt ckpt | cfg

Main Results on Downstream Tasks

COCO Object Detection

backbone method schd box mAP mask mAP #params FLOPs Download
InternImage-T Mask R-CNN 1x 47.2 42.5 49M 270G ckpt | cfg
InternImage-T Mask R-CNN 3x 49.1 43.7 49M 270G ckpt | cfg
InternImage-S Mask R-CNN 1x 47.8 43.3 69M 340G ckpt | cfg
InternImage-S Mask R-CNN 3x 49.7 44.5 69M 340G ckpt | cfg
InternImage-B Mask R-CNN 1x 48.8 44.0 115M 501G ckpt | cfg
InternImage-B Mask R-CNN 3x 50.3 44.8 115M 501G ckpt | cfg
InternImage-L Cascade 1x 54.9 47.7 277M 1399G ckpt | cfg
InternImage-L Cascade 3x 56.1 48.5 277M 1399G ckpt | cfg
InternImage-XL Cascade 1x 55.3 48.1 387M 1782G ckpt | cfg
InternImage-XL Cascade 3x 56.2 48.8 387M 1782G ckpt | cfg

ADE20K Semantic Segmentation

backbone resolution single scale multi scale #params FLOPs Download
InternImage-T 512x512 47.9 48.1 59M 944G ckpt | cfg
InternImage-S 512x512 50.1 50.9 80M 1017G ckpt | cfg
InternImage-B 512x512 50.8 51.3 128M 1185G ckpt | cfg
InternImage-L 640x640 53.9 54.1 256M 2526G ckpt | cfg
InternImage-XL 640x640 55.0 55.3 368M 3142G ckpt | cfg

Main Results of FPS

name resolution #params FLOPs Batch 1 FPS(PyTorch) Batch 1 FPS(TensorRT)
InternImage-T 224x224 30M 5G 44 156
InternImage-S 224x224 50M 8G 40 129
InternImage-B 224x224 97M 16G 40 116
InternImage-L 384x384 223M 108G 40 56
InternImage-XL 384x384 335M 163G 32 47

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

About

License:MIT License


Languages

Language:Python 60.9%Language:Cuda 33.3%Language:C++ 4.7%Language:Shell 1.1%