SegFormer and HRNet Comparason for Semantic Segmentation

This (incomplete) repo consists of an image segmentation pipeline on the Cityscapes dataset, using HRNet, and a powerful new transformer-based architecture called SegFormer . The scripts for data preprocessing, training, and inference are done mainly from scratch. The model construction code for HRNet (models/hrnet.py) and SegFormer (models/segformer.py) have been adapted from the official mmseg implementation, whereas models/segformer_simple.py contains a very clean SegFormer implementation that may not be correct.

HRNet and SegFormer are useful architectures to compare, because they represent fundamentally different approaches to image understanding. HRNet - like most other vision architectures - is at its core a series of convolution operations that are stacked, fused, and connected in a very efficient manner. SegFormer, on the other hand, has no convolutional operations, and instead uses transformer layers. It treats each image as a sequence of tokens, where each token represents a 4x4 pixel patch of the image.

For training, the implementation details of the original papers are followed as closely as possible.

Due to memory limitations (single RTX 3090 GPU 24 GB), gradient accumilation was used for training the SegFormer model.

HRNet

SegFormer

Official SegFormer

Replication of the B5 model in the official repository. The number of parameters matches up with the paper. The total number of multiply adds may be irrelevant, since it is difficult to determine if it is the same calculation used in the paper to calculate "flops".

model = Segformer(
    pretrained=cfg.MODEL.PRETRAINED,
    img_size=1024,
    patch_size=4, 
    embed_dims=[64, 128, 320, 512], 
    num_heads=[1, 2, 5, 8], 
    mlp_ratios=[4, 4, 4, 4],
    qkv_bias=True, 
    norm_layer=partial(nn.LayerNorm, eps=1e-6), 
    depths=[3, 6, 40, 3], 
    sr_ratios=[8, 4, 2, 1],
    drop_rate=0.0, 
    drop_path_rate=0.1,
    decoder_dim = 768
)

Total Parameters: 85,915,731

Total Multiply Adds (For Convolution and Linear Layers only): 11,607 GFLOPs

Number of Layers

Conv2d : 107 layers
LayerNorm : 161 layers
OverlapPatchEmbed : 4 layers
Linear : 264 layers
Dropout : 208 layers
Attention : 52 layers
Identity : 2 layers
DWConv : 52 layers
GELU : 52 layers
Mlp : 52 layers
Block : 52 layers
DropPath : 102 layers
LinearMLP : 4 layers
Dropout2d : 1 layers

Simple SegFormer

Code taken from this repo

model = Segformer(
    dims = (64, 128, 320, 512),     # dimensions of each stage
    heads = (1, 2, 5, 8),           # heads of each stage
    ff_expansion = (4, 4, 4, 4),    # feedforward expansion factor of each stage
    reduction_ratio = (8, 4, 2, 1), # reduction ratio of each stage for efficient attention
    num_layers = (3, 6, 40, 3),     # num layers of each stage
    decoder_dim = 768,              # decoder dimension
    num_classes = 19                # number of segmentation classes
).to(device)

Total Parameters: 255,280,531

Total Multiply Adds (For Convolution and Linear Layers only): 679 GFLOPs

Number of Layers

MiT : 1 layers
Unfold : 4 layers
Conv2d : 374 layers
LayerNorm : 104 layers
EfficientSelfAttention : 52 layers
PreNorm : 104 layers
DsConv2d : 52 layers
GELU : 52 layers
MixFeedForward : 52 layers
Upsample : 4 layers

jo-jstrm / segmentation_pytorch

SegFormer and HRNet Comparason for Semantic Segmentation

HRNet

SegFormer

Official SegFormer

Simple SegFormer

About

Languages