Pyramid Vision Transformer for Object Detection by detectron2, together with Conditional Positional Encodings for Vision Transformers and Twins: Revisiting the Design of Spatial Attention in Vision Transformers.
This repo contains the supported code and configuration files to reproduce object detection results of Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. It is based on detectron2.
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
PVT-Small | ImageNet-1K | 1x | 41.6 | - | 34.2M | 226G | config | - | model |
PCPVT-Small | ImageNet-1K | 1x | 44.2 | - | 34.4M | 226G | config | - | model |
Twins-SVT-Small | ImageNet-1K | 1x | 43.1 | - | 34.3M | 209G | config | - | model |
The box mAP (41.6 vs 40.4) is better than implementation of the mmdetection version (need checked?)
The performance gap maybe lie in the training strategy of resize
:
# The resize in mmdetection is single scale:
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
# The resize in detectron2 for retinanet is multi-scale:
# Size of the smallest side of the image during training
_C.INPUT.MIN_SIZE_TRAIN = (640, 672, 704, 736, 768, 800)
# Sample size of smallest side by choice or random selection from range give by
# INPUT.MIN_SIZE_TRAIN
_C.INPUT.MIN_SIZE_TRAIN_SAMPLING = "choice"
# Maximum size of the side of the image during training
_C.INPUT.MAX_SIZE_TRAIN = 1333
# Size of the smallest side of the image during testing. Set to zero to disable resize in testing.
_C.INPUT.MIN_SIZE_TEST = 800
# Maximum size of the side of the image during testing
_C.INPUT.MAX_SIZE_TEST = 1333
# Mode for flipping images used in data augmentation during training
# choose one of ["horizontal, "vertical", "none"]
_C.INPUT.RANDOM_FLIP = "horizontal"
Please refer to get_started.md for installation and dataset preparation.
note: you need convert the original pretrained weights to d2 format by convert_to_d2.py