RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation, ECCV 2022

News

2022.07.19 We rename MLSeg to RankSeg to highlight the importance of our rank-oriented design.

2022.07.04 MLSeg has been accepted by ECCV 2022.

Introduction

The segmentation task has traditionally been formulated as a complete-label pixel classification task to predict a class for each pixel from a fixed number of predefined semantic categories shared by all images or videos. Yet, following this formulation, standard architectures will inevitably encounter various challenges under more realistic settings where the scope of categories scales up (e.g., beyond the level of 1k). On the other hand, in a typical image or video, only a few categories, i.e., a small subset of the complete label are present. Motivated by this intuition, in this paper, we propose to decompose segmentation into two sub-problems: (i) image-level or video-level multi-label classification and (ii) pixel-level rank-adaptive selected-label classification. Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores. We then use a rank-adaptive pixel classifier to perform the pixel-wise classification over only the selected labels, which uses a set of rank-oriented learnable temperature parameters to adjust the pixel classifications scores. Our approach is conceptually general and can be used to improve various existing segmentation frameworks by simply using a lightweight multi-label classification head and rank-adaptive pixel classifier. We demonstrate the effectiveness of our framework with competitive experimental results across four tasks, including image semantic segmentation, image panoptic segmentation, video instance segmentation, and video semantic segmentation. Especially, with our RankSeg, Mask$2$Former gains +0.8%/+0.7%/+0.7% on ADE$20$K panoptic segmentation/YouTubeVIS 2019 video instance segmentation/VSPW video semantic segmentation benchmarks respectively.

The RankSeg architecture:

Image Semantic Segmentation based on DeepLabV3/Segmenter/Swin/BEiT + RankSeg

RankSeg + DeepLabV3

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
DeepLabV3 (Official)	COCO-Stuff	R101	512x512	20000	37.3	38.4	-	-
DeepLabV3 + RankSeg	COCO-Stuff	R101	512x512	20000	38.4	39.8	-	-
DeepLabV3 (Official)	ADE20K	R101	512x512	80000	44.1	45.2	-	-
DeepLabV3 + RankSeg	ADE20K	R101	512x512	80000	45.5	46.6	-	-
DeepLabV3	COCO+LVIS	R101	512x512	160000	11.0	-	-	-
DeepLabV3 + RankSeg	COCO+LVIS	R101	512x512	160000	12.8	-	-	-

RankSeg + Segmenter

Multi-Scale test is not conducted on ADE20KFull and COCO+LVIS datasets because of memory limits.

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
Segmenter	COCO-Stuff	ViT-B	512x512	40000	41.9	43.8	-	-
Segmenter + RankSeg	COCO-Stuff	ViT-B	512x512	40000	44.9	46.2	-	-
Segmenter	COCO-Stuff	ViT-B	512x512	80000	43.4	45.2	-	-
Segmenter + RankSeg	COCO-Stuff	ViT-B	512x512	80000	45.7	46.7	-	-
Segmenter	COCO-Stuff	ViT-L	640x640	40000	45.5	47.1	-	-
Segmenter + RankSeg	COCO-Stuff	ViT-B	640x640	40000	46.7	47.9	-	-
Segmenter	Pascal-Context60	ViT-B	480x480	80000	53.8	54.6	-	-
Segmenter + RankSeg	Pascal-Context60	ViT-B	480x480	80000	54.7	55.4	-	-
Segmenter	ADE20K	ViT-B	512x512	160000	48.8	50.7	-	-
Segmenter + RankSeg	ADE20K	ViT-B	512x512	160000	49.7	51.4	-	-
Segmenter	ADE20K	ViT-L	640x640	160000	52.0	53.6	-	-
Segmenter + RankSeg	ADE20K	ViT-L	640x640	160000	52.6	54.4	-	-
Segmenter	ADE20KFull	ViT-B	512x512	160000	17.8	-	-	-
Segmenter + RankSeg	ADE20KFull	ViT-B	512x512	160000	18.8	-	-	-
Segmenter	COCO+LVIS	ViT-B	512x512	320000	19.4	-	-	-
Segmenter + RankSeg	COCO+LVIS	ViT-B	512x512	320000	21.3	-	-	-
Segmenter	COCO+LVIS	ViT-B	640x640	320000	23.7	-	-	-
Segmenter + RankSeg	COCO+LVIS	ViT-B	640x640	320000	24.6	-	-	-

RankSeg + Swin

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
Swin	COCO-Stuff	Swin-B	512x512	40000	45.7	47.2	-	-
Swin + RankSeg	COCO-Stuff	Swin-B	512x512	40000	46.6	47.9	-	-
Swin (Official)	ADE20K	Swin-B	512x512	160000	50.8	52.4	-	-
Swin + RankSeg	ADE20K	Swin-B	512x512	160000	51.4	53.0	-	-
Swin	COCO+LVIS	Swin-B	512x512	160000	20.3	-	-	-
Swin + RankSeg	COCO+LVIS	Swin-B	512x512	160000	20.8	-	-	-

RankSeg + BEiT

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
BEiT (Official)	ADE20K	BEiT-L	640x640	160000	56.7	57.0	-	-
RankSeg + BEiT	ADE20K	BEiT-L	640x640	160000	57.0	57.8	-	-
BEiT (Official)	COCO-Stuff	BEiT-L	640x640	160000	49.7	49.9	-	-
RankSeg + BEiT	COCO-Stuff	BEiT-L	640x640	160000	49.9	50.3	-	-

Image Semantic & Panoptic Segmentation based on MaskFormer + RankSeg

Semantic Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
MaskFormer	ADE20K	Swin-B	512x512	160000	52.7	53.9	-	-
MaskFormer + RankSeg	ADE20K	Swin-B	512x512	160000	53.9	55.1	-	-

Panoptic Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	PQ	PQ-th	PQ-st	RQ	RQ-th	RQ-st	SQ	SQ-th	SQ-st	config	download
MaskFormer	ADE20K	R50	640x640	720000	34.7	32.2	39.7	42.8	40.1	48.1	76.7	76.9	76.3	-	-
MaskFormer + RankSeg	ADE20K	R50	640x640	720000	36.5	34.5	40.6	44.9	42.8	48.9	76.8	77.1	76.0	-	-
MaskFormer + RankSeg + GT	ADE20K	R50	640x640	720000	44.3	39.7	53.5	54.5	49.5	64.6	79.6	78.6	81.7	-	-

Image Semantic & Image Panoptic & Video Semantic & Video Instance Segmentation based on Mask2Former + RankSeg

Semantic Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	mIoU(ms+flip)	config	download
Mask2Former	ADE20K	Swin-B	512x512	160000	53.9	55.1	-	-
Mask2Former + RankSeg	ADE20K	Swin-B	512x512	160000	54.9	55.6	-	-
Mask2Former	ADE20K	Swin-L	512x512	160000	56.1	57.3	-	-
Mask2Former + RankSeg	ADE20K	Swin-L	512x512	160000	56.5	58.0	-	-

Panoptic Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	PQ	config	download
Mask2Former	ADE20K	Swin-L	512x512	160000	48.1	-	-
Mask2Former + RankSeg	ADE20K	Swin-L	512x512	160000	48.9	-	-

Video Semantic Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	mIoU	config	download
Mask2Former	VSPW	R101	512x512	6000	45.9	-	-
Mask2Former + RankSeg	VSPW	R101	512x512	6000	47.0	-	-
Mask2Former	VSPW	Swin-L	512x512	6000	59.4	-	-
Mask2Former + RankSeg	VSPW	Swin-L	512x512	6000	60.1	-	-

Video Instance Segmentation

Method	Dataset	Backbone	Crop Size	Lr schd	AP	config	download
Mask2Former	YoutubeVIS2019	R101	512x512	6000	49.2	-	-
Mask2Former + RankSeg	YoutubeVIS2019	R101	512x512	6000	50.5	-	-
Mask2Former	YoutubeVIS2019	Swin-B	512x512	6000	59.5	-	-
Mask2Former + RankSeg	YoutubeVIS2019	Swin-B	512x512	6000	60.3	-	-
Mask2Former	YoutubeVIS2019	Swin-L	512x512	6000	60.4	-	-
Mask2Former + RankSeg	YoutubeVIS2019	Swin-L	512x512	6000	61.1	-	-

Citation

If you find this project useful in your research, please consider cite:

@article{HYYH2022RankSeg,
  title={RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation},
  author={Haodi He and Yuhui Yuan and Xiangyu Yue and Han Hu},
  booktitle={arXiv preprint arXiv:2203.04187},
  year={2022}
}

git diff-index HEAD
git subtree add -P pose <url to sub-repo> <sub-repo branch>

About

[ECCV2022] This is an official implementation of paper "RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation".