joihn / direct-pretraining

Fast and accurate pipeline for training object detector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rethinking Training from Scratch for Object Detection


Code for paper Rethinking Training from Scratch for Object Detection.

The ImageNet pre-training initialization is the de-facto standard for object detection. He et al. found it is possible to train detector from scratch(random initialization) while needing a longer training schedule with proper normalization technique. In this paper, we explore to directly pre-training on target dataset for object detection. Under this situation, we discover that the widely adopted large resizing strategy e.g. resize image to (1333, 800) is important for fine-tuning but it's not necessary for pre-training. Specifically, we propose a new training pipeline for object detection that follows `pre-training and fine-tuning', utilizing low resolution images within target dataset to pre-training detector then load it to fine-tuning with high resolution images. With this strategy, we can use batch normalization(BN) with large bath size during pre-training, it's also memory efficient that we can apply it on machine with very limited GPU memory(11G). We call it direct detection pre-training, and also use direct pre-training for short. Experiment results show that direct pre-training accelerates the pre-training phase by more than 11x on COCO dataset while with even +1.8mAP compared to ImageNet pre-training. Besides, we found direct pre-training is also applicable to transformer based backbones e.g. Swin Transformer.

Pre-trained models

method pipeline bbox mAP mask mAP config model
RetinaNet ImageNet-1x 36.5 - - -
RetinaNet Direct(P1x)-1x 37.1 - pre-train|fine-tune pre-train|fine-tune
Faster RCNN ImageNet-1x 37.4 - - -
Faster RCNN Direct(P1x)-1x 39.3 - pre-train|fine-tune pre-train|fine-tune
Cascade RCNN ImageNet-1x 40.3 - - -
Cascade RCNN Direct(P1x)-1x 41.5 - pre-train|fine-tune pre-train|fine-tune
Mask RCNN ImageNet-1x 38.2 34.7 - -
Mask RCNN Direct(P1x)-1x 40.0 35.8 pre-train|fine-tune pre-train|fine-tune
Mask RCNN w/ Swin ImageNet-1x 43.8 39.6 - -
Mask RCNN w/ Swin Direct(P1x)-1x 45.0 40.5 pre-train|fine-tune pre-train|fine-tune

We also provide models are sufficiently trained with longer schedule(3x), could be used for model initialization.

method pipeline bbox mAP mask mAP model
Mask RCNN Direct(P3x)-1x 41.5 37.0 pre-train|fine-tune
Mask RCNN w/ Swin Direct(P3x)-1x 46.9 41.9 pre-train|fine-tune


This project mainly reference MMDetection codebase and Swin Transformer.

git clone
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./


Prepare Dataset

git clone
cd direct-pretraining

# set dataset path
mkdir data
ln -s path/to/coco_dataset data/


# single-gpu testing

# multi-gpu testing


./ configs/direct_pretraining/ model.pth 8 --eval bbox segm


  • pre-training:
  • fine-tuning:
# set config file *load_from* the pre-trained model path

Citing Direct Pre-training

if this paper helps you, please consider citing direct pre-training.

  title={Rethinking Training from Scratch for Object Detection},
  author={Yang, Li and Hong, Zhang and Yu, Zhang},
  journal={arXiv preprint arXiv:2106.03112},


Fast and accurate pipeline for training object detector

License:Apache License 2.0


Language:Python 99.1%Language:Shell 0.9%