This is the official implementation of the paper HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer.
Model | Pretraining data | ImageNet-1K | COCO Det | ADE Seg |
---|---|---|---|---|
MAE-base | ImageNet-1K | 83.6 | 51.2 | 48.1 |
SimMIM-base | ImageNet-1K | 84.0 | 52.3 | 52.8 |
HiViT-base | ImageNet-1K | 84.6 | 53.3 | 52.8 |
mae_hivit_base_1600ep_ft100ep.pth
1. Supervised learning on ImageNet-1K.: See supervised/get_started.md for a quick start.
2. Self-supervised learning on ImageNet-1K.: See self_supervised/get_started.md.
3. Object detection: See detection/get_started.md.
4. Semantic segmentation: See segmentation/get_started.md.
Please consider citing our paper in your publications if the project helps your research.
@inproceedings{zhanghivit,
title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
booktitle={International Conference on Learning Representations},
year={2023},
}