This is an official implementation for "Contextual Transformer Networks for Visual Recognition".

Home Page:https://arxiv.org/pdf/2107.12292.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This repository is the official implementation of Contextual Transformer Networks for Visual Recognition.

CoT is a unified self-attention building block, and acts as an alternative to standard convolutions in ConvNet. As a result, it is feasible to replace convolutions with their CoT counterparts for strengthening vision backbones with contextualized self-attention.

2021/3/25-2021/6/5: CVPR 2021 Open World Image Classification Challenge

Rank 1 in Open World Image Classification Challenge @ CVPR 2021. (Team name: VARMS)


The code is mainly based on timm.


  • PyTorch 1.8.0+
  • Python3.7
  • CUDA 10.1+
  • CuPy.

Clone the repository:

git clone https://github.com/JDAI-CV/CoTNet.git


First, download the ImageNet dataset. To train CoTNet-50 on ImageNet on a single node with 8 gpus for 350 epochs run:

python -m torch.distributed.launch --nproc_per_node=8 train.py --folder ./experiments/cot_experiments/CoTNet-50-350epoch

The training scripts for CoTNet (e.g., CoTNet-50) can be found in the cot_experiments folder.

Inference Time vs. Accuracy

CoTNet models consistently obtain better top-1 accuracy with less inference time than other vision backbones across both default and advanced training setups. In a word, CoTNet models seek better inference time-accuracy trade-offs than existing vision backbones.

Results on ImageNet

name resolution #params FLOPs Top-1 Acc. Top-5 Acc. model
CoTNet-50 224 22.2M 3.3 81.3 95.6 GoogleDrive / Baidu
CoTNeXt-50 224 30.1M 4.3 82.1 95.9 GoogleDrive / Baidu
SE-CoTNetD-50 224 23.1M 4.1 81.6 95.8 GoogleDrive / Baidu
CoTNet-101 224 38.3M 6.1 82.8 96.2 GoogleDrive / Baidu
CoTNeXt-101 224 53.4M 8.2 83.2 96.4 GoogleDrive / Baidu
SE-CoTNetD-101 224 40.9M 8.5 83.2 96.5 GoogleDrive / Baidu
SE-CoTNetD-152 224 55.8M 17.0 84.0 97.0 GoogleDrive / Baidu
SE-CoTNetD-152 320 55.8M 26.5 84.6 97.1 GoogleDrive / Baidu

Access code for Baidu is cotn

CoTNet on downstream tasks

For Object Detection and Instance Segmentation, please see CoTNet for Object Detection and Instance Segmentation.

Citing Contextual Transformer Networks

  title={Contextual Transformer Networks for Visual Recognition},
  author={Li, Yehao and Yao, Ting and Pan, Yingwei and Mei, Tao},
  journal={arXiv preprint arXiv:2107.12292},


Thanks the contribution of timm and awesome PyTorch team.

ezoic increase your site revenue


This is an official implementation for "Contextual Transformer Networks for Visual Recognition".




Language:Python 99.7%Language:Shell 0.3%