Generative Semantic Segmentation

Paper

Generative Semantic Segmentation,
Jiaqi Chen, Jiachen Lu, Xiatian Zhu, and Li Zhang
CVPR 2023

Abstract

We present Generative Semantic Segmentation (GSS), a generative framework for semantic segmentation. Unlike previous methods addressing a per-pixel classification problem, we cast semantic segmentation into an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. This is done by expressing the segmentation mask with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To implement semantic segmentation, we further introduce a conditioning network (e.g., an encoder-decoder Transformer) optimized by minimizing the divergence between the posterior distribution of maskige (i.e. segmentation masks) and the latent prior distribution of input images on the training set. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting.

TODO List

Upload model weights and DALL-E VQVAE weight
Provide stage-1 training code and Maskige reconstruction code
Provide the illustration of the GSS-FF and GSS-FT-W (and more training details)
Complete install.md
Add dataset link

Results

Cityscapes

Name	Backbone	Iterations	mIoU	mAcc	Config	checkpoint
GSS-FF	R101	80k	77.76	85.9	config	google drive
GSS-FF	Swin-L	80k	78.90	87.03	config	google drive
GSS-FT-W	ResNet	80k	78.46	85.92	config	google drive
GSS-FT-W	Swin-L	80k	80.05	87.32	config	google drive

ADE20K

Name	Backbone	Iterations	mIoU	mAcc	Config	checkpoint
GSS-FF	Swin-L	160k	46.29	57.84	config	google drive
GSS-FT-W	Swin-L	160k	48.54	58.94	config	google drive

MSeg

Name	Backbone	Iterations	h.mean	Config	checkpoint
GSS-FF	HRNet-W48	160k	52.60	config	google drive
GSS-FF	Swin-L	160k	59.49	config	google drive
GSS-FT-W	HRNet-W48	160k	55.20	config	google drive
GSS-FT-W	Swin-L	160k	61.94	config	google drive

Get Started

Prepare Environment

This implementation is build upon mmsegmentation, please follow the steps in install.md to prepare the environment and dataset preparation.

We utilize the DALL-E pre-trained VQVAE weights and freeze both the encoder and decoder. Please download the encoder and decoder weights using following command:

bash tools/download_pretrain_vqvae.sh

Eval

Please download the pre-trained model weights and put them in the ./<ckp_dir> folder. We provide the following scripts to evaluate GSS.

bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval mIoU

For example, to evaluate the GSS-FF model on Cityscapes dataset, run:

# test with 8 GPUs
bash tools/dist_test.sh configs/gss/cityscapes/gss-ff_r101_768x768_80k_cityscapes.py ./<ckp_dir>/gss-ff_swin-l_768x768_80k_cityscapes_iter_80000.pth 8 --eval mIoU

Train

The training process is divided into 1. latent posterior learning of $\mathcal{X}$; 2. latent prior learning; and 3. latent posterior learning of $\mathcal{X}^{-1}$ (this process is only needed by GSS-FT-W). See TRAIN.md for more information.

Reference

@inproceedings{chen2023generative,
  title={Generative Semantic Segmentation
  author={Chen, Jiaqi and Lu, Jiachen and Zhu, Xiatian and Zhang, Li},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

hzhang57 / GSS