songhwanjun / MEDUSA

A RGB-D object detection paper accepted at BMVC 2021.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exploiting Scene Depth for Object Detection with Multimodal Transformers

  • Oct 17, 2021: Our work is accepted at BMVC 2021

Citation

Please consider citation if our paper is useful in your research.

Exploiting Scene Depth for Object Detection with Multimodal Transformers

by Hwanjun Song1, Eunyoung Kim2, Varun Jampani2, Deqing Sun2, Jae-Gil Lee3, and Ming-Hsuan Yang2,4,5

1 NAVER AI Lab, 2 Google Researrch, 3 KAIST, 4 University California Merced, 5 Yonsei University

@inproceedings{MEDUSA,
  title={Exploiting Scene Depth for Object Detection with Multimodal Transformers},
  author={Song, Hwanjun and Kim, Eunyoung and Jampani, Varun and Sun, Deqing and Lee, Jae-Gil and Yang, Ming-Hsuan},
  booktitle={BMVC},
  year={2021}
}

Overview

This project implements a framework to fuse RGB an depth information using multimodal Transformers in the context of object detection. The goal is to show that the inferred depth maps should play an important role to put the limit of appearance-based object detection. Thus, we propose a generic framework MEDUSA (Multimodal Estimated-Depth Unification with Self-Attention). Unlike previous methods that use the depth measured from various physical sensors such as Kinect and Lidar, we show that the depth maps inferred by a monocular depth estimator can play an important role to enhance the performance of modern object detectors. In order to make use of the estimated depth, MEDUSA encompasses a robust feature extraction phase, followed by multimodal transformers for RGB-D fusion, as can be seen in the figure above.

What it is. Unlike previous studies that use the depth measured from various physical sensors such as Kinect and Lidar, MEDUSA is a novel object detection pipeline that encompasses a robust feature extractor for RGB images and their noisy (estimated) depth maps, followed by multimodal Transformers for achieving optimal RGB-D fusion via the self-attention mechanism.

About the code. This code is based on the original implementation of DETR, a transformer-based object detection framework, requiring Pytorch 1.5+.

Microsoft COCO Data with Inferred Depth Maps

We extended a large-scale objecct detection dataset, Microsoft COCO, by applying the state-of-the-art monocular depth estimator, MiDaS (paper). The extracted depth maps for Microsoft COCO is available [here]. Please put the two folders, train2017_depth and val2017_depth, at the same location where train2017 and val2017 folders exist.

Training

Please run the run_distributed_medusa.py file using suitable hyperparameters inside.

# Hyperparameters in run_distributed_medusa.py (e.g., 8 GPU setup).
batch_size = '4'
image_height = 800
image_width = 1333
num_workers = '2'
epochs = '150'
lr_drop = 100
print_freq = 200
path = '/home/Research/COCO2017'

# run the training script.
python run_distributed_medusa.py

Comparison with RGB-Only DETR

The convergence curves on mAP is summarized in the Figure below. See the details in our paper.

The AP of MEDUSA compared with RGB-Only DETR are summarized in the table below.

name resolution schedule box AP url
1 DETR 300x500 150 27.4 model | logs
2 MEDUSA 300x500 150 28.9 (+1.5) model | logs
3 DETR 420x700 150 32.5 model | logs
4 MEDUSA 420x700 150 33.6 (+1.1) model | logs
5 DETR 800x1333 150 38.0 model | logs
6 MEDUSA 800x1333 150 40.0 (+2.0) model | logs

Acknowledgement

This project is based on DETR (paper). Thanks for their wonderful works.

About

A RGB-D object detection paper accepted at BMVC 2021.

License:Apache License 2.0


Languages

Language:Python 100.0%