Mask DINO

By Feng Li*, Hao Zhang*, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum.

This repository is an official implementation of the Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation (DINO pronounced `daɪnoʊ' as in dinosaur).

Code will be available soon, please stay tuned!

News

[2022/9]: We release a toolbox detrex that provides state-of-the-art Transformer-based detection algorithms. It includes DINO with better performance and Mask DINO will also be released with detrex implementation. Welcome to use it!

Supports Now: DETR, Deformble DETR, Conditional DETR, Group-DETR, DAB-DETR, DN-DETR, DINO.

[2022/7] Code for DINO is available here!

[2022/5]DN-DETR is accepted to CVPR 2022 as an Oral presentation. Code is now avaliable here.

[2022/4]DAB-DETR is accepted to ICLR 2022. Code is now avaliable here.

[2022/3]We release a SOTA detection model DINO that for the first time establishes a DETR-like model as a SOTA model on the leaderboard. Code will be avaliable here.

[2022/3]We build a repo Awesome Detection Transformer to present papers about transformer for detection and segmentation. Welcome to your attention!

Introduction

Abstract: In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K).

Resutls

SOTA results on Instance, Panoptic, and Sementic Segmentation.

We have established the best results on all three segmentation tasks to date.

Instance segementation and Object detection

Panoptic segementation

Semantic segementation

For more experimental results and ablation study, please refer to our paper.

Model

We build upon the object detection model DINO:DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection and extend it to segmentation tasks with minimal modifications.

Links

Our work is based on DINO and is also closely related to previous work DN-DETR and DAB-DETR.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
Hao Zhang*, Feng Li*, Shilong Liu*, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum.
Arxiv 2022.
[paper] [code]

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising.
Feng Li*, Hao Zhang*, Shilong Liu, Jian Guo, Lionel M. Ni, Lei Zhang.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2022. Oral.
[paper] [code] [中文解读]

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR.
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang.
International Conference on Learning Representations (ICLR) 2022.
[paper] [code]

Bibtex

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{li2022mask,
      title={Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation}, 
      author={Feng Li and Hao Zhang and Huaizhe xu and Shilong Liu and Lei Zhang and Lionel M. Ni and Heung-Yeung Shum},
      year={2022},
      eprint={2206.02777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

czczup / MaskDINO