bobleono / pvt_detectron2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pvt_detectron2

Pyramid Vision Transformer for Object Detection by detectron2, together with Conditional Positional Encodings for Vision Transformers and Twins: Revisiting the Design of Spatial Attention in Vision Transformers.

This repo contains the supported code and configuration files to reproduce object detection results of Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. It is based on detectron2.

Results and Models

RetinaNet

Backbone Pretrain Lr Schd box mAP mask mAP #params FLOPs config log model
PVT-Small ImageNet-1K 1x 41.6 - 34.2M 226G config - model
PCPVT-Small ImageNet-1K 1x 44.2 - 34.4M 226G config - model
Twins-SVT-Small ImageNet-1K 1x 43.1 - 34.3M 209G config - model

The box mAP (41.6 vs 40.4) is better than implementation of the mmdetection version (need checked?)

The performance gap maybe lie in the training strategy of resize:

# The resize in mmdetection is single scale:
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

# The resize in detectron2 for retinanet is multi-scale:

# Size of the smallest side of the image during training
_C.INPUT.MIN_SIZE_TRAIN = (640, 672, 704, 736, 768, 800)
# Sample size of smallest side by choice or random selection from range give by
# INPUT.MIN_SIZE_TRAIN
_C.INPUT.MIN_SIZE_TRAIN_SAMPLING = "choice"
# Maximum size of the side of the image during training
_C.INPUT.MAX_SIZE_TRAIN = 1333
# Size of the smallest side of the image during testing. Set to zero to disable resize in testing.
_C.INPUT.MIN_SIZE_TEST = 800
# Maximum size of the side of the image during testing
_C.INPUT.MAX_SIZE_TEST = 1333
# Mode for flipping images used in data augmentation during training
# choose one of ["horizontal, "vertical", "none"]
_C.INPUT.RANDOM_FLIP = "horizontal"

Usage

Please refer to get_started.md for installation and dataset preparation.

note: you need convert the original pretrained weights to d2 format by convert_to_d2.py

References

About

License:MIT License


Languages

Language:Python 100.0%