jo-jstrm / segmentation_pytorch

Simple image segmentation pipeline in pytorch, using HRNet and SegFormer models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SegFormer and HRNet Comparason for Semantic Segmentation

This (incomplete) repo consists of an image segmentation pipeline on the Cityscapes dataset, using HRNet, and a powerful new transformer-based architecture called SegFormer . The scripts for data preprocessing, training, and inference are done mainly from scratch. The model construction code for HRNet (models/hrnet.py) and SegFormer (models/segformer.py) have been adapted from the official mmseg implementation, whereas models/segformer_simple.py contains a very clean SegFormer implementation that may not be correct.

HRNet and SegFormer are useful architectures to compare, because they represent fundamentally different approaches to image understanding. HRNet - like most other vision architectures - is at its core a series of convolution operations that are stacked, fused, and connected in a very efficient manner. SegFormer, on the other hand, has no convolutional operations, and instead uses transformer layers. It treats each image as a sequence of tokens, where each token represents a 4x4 pixel patch of the image.

For training, the implementation details of the original papers are followed as closely as possible.

Due to memory limitations (single RTX 3090 GPU 24 GB), gradient accumilation was used for training the SegFormer model.

HRNet


SegFormer


Official SegFormer

Replication of the B5 model in the official repository. The number of parameters matches up with the paper. The total number of multiply adds may be irrelevant, since it is difficult to determine if it is the same calculation used in the paper to calculate "flops".

model = Segformer(
    pretrained=cfg.MODEL.PRETRAINED,
    img_size=1024,
    patch_size=4, 
    embed_dims=[64, 128, 320, 512], 
    num_heads=[1, 2, 5, 8], 
    mlp_ratios=[4, 4, 4, 4],
    qkv_bias=True, 
    norm_layer=partial(nn.LayerNorm, eps=1e-6), 
    depths=[3, 6, 40, 3], 
    sr_ratios=[8, 4, 2, 1],
    drop_rate=0.0, 
    drop_path_rate=0.1,
    decoder_dim = 768
)

Total Parameters: 85,915,731

Total Multiply Adds (For Convolution and Linear Layers only): 11,607 GFLOPs

Number of Layers

  • Conv2d : 107 layers
  • LayerNorm : 161 layers
  • OverlapPatchEmbed : 4 layers
  • Linear : 264 layers
  • Dropout : 208 layers
  • Attention : 52 layers
  • Identity : 2 layers
  • DWConv : 52 layers
  • GELU : 52 layers
  • Mlp : 52 layers
  • Block : 52 layers
  • DropPath : 102 layers
  • LinearMLP : 4 layers
  • Dropout2d : 1 layers

Simple SegFormer

Code taken from this repo

model = Segformer(
    dims = (64, 128, 320, 512),     # dimensions of each stage
    heads = (1, 2, 5, 8),           # heads of each stage
    ff_expansion = (4, 4, 4, 4),    # feedforward expansion factor of each stage
    reduction_ratio = (8, 4, 2, 1), # reduction ratio of each stage for efficient attention
    num_layers = (3, 6, 40, 3),     # num layers of each stage
    decoder_dim = 768,              # decoder dimension
    num_classes = 19                # number of segmentation classes
).to(device)

Total Parameters: 255,280,531

Total Multiply Adds (For Convolution and Linear Layers only): 679 GFLOPs

Number of Layers

  • MiT : 1 layers
  • Unfold : 4 layers
  • Conv2d : 374 layers
  • LayerNorm : 104 layers
  • EfficientSelfAttention : 52 layers
  • PreNorm : 104 layers
  • DsConv2d : 52 layers
  • GELU : 52 layers
  • MixFeedForward : 52 layers
  • Upsample : 4 layers

About

Simple image segmentation pipeline in pytorch, using HRNet and SegFormer models


Languages

Language:Jupyter Notebook 99.2%Language:Python 0.8%