pvt_detectron2

Pyramid Vision Transformer for Object Detection by detectron2, together with Conditional Positional Encodings for Vision Transformers and Twins: Revisiting the Design of Spatial Attention in Vision Transformers.

This repo contains the supported code and configuration files to reproduce object detection results of Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. It is based on detectron2.

Results and Models

RetinaNet

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	config	log	model
PVT-Small	ImageNet-1K	1x	41.6	-	34.2M	226G	config	-	model
PCPVT-Small	ImageNet-1K	1x	44.2	-	34.4M	226G	config	-	model
Twins-SVT-Small	ImageNet-1K	1x	43.1	-	34.3M	209G	config	-	model

The box mAP (41.6 vs 40.4) is better than implementation of the mmdetection version (need checked?)

The performance gap maybe lie in the training strategy of resize:

# The resize in mmdetection is single scale:
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

# The resize in detectron2 for retinanet is multi-scale:

# Size of the smallest side of the image during training
_C.INPUT.MIN_SIZE_TRAIN = (640, 672, 704, 736, 768, 800)
# Sample size of smallest side by choice or random selection from range give by
# INPUT.MIN_SIZE_TRAIN
_C.INPUT.MIN_SIZE_TRAIN_SAMPLING = "choice"
# Maximum size of the side of the image during training
_C.INPUT.MAX_SIZE_TRAIN = 1333
# Size of the smallest side of the image during testing. Set to zero to disable resize in testing.
_C.INPUT.MIN_SIZE_TEST = 800
# Maximum size of the side of the image during testing
_C.INPUT.MAX_SIZE_TEST = 1333
# Mode for flipping images used in data augmentation during training
# choose one of ["horizontal, "vertical", "none"]
_C.INPUT.RANDOM_FLIP = "horizontal"

Usage

Please refer to get_started.md for installation and dataset preparation.

note: you need convert the original pretrained weights to d2 format by convert_to_d2.py

bobleono / pvt_detectron2

pvt_detectron2

Results and Models

RetinaNet

Usage

References

About

Languages