Train & eval inputs for different benchmarks

Question

Train & eval inputs for different benchmarks

DianCh opened this issue a year ago · comments

Hi! Thank you for releasing this wonderful work! I am wondering how the inputs look like for different benchmarks, like how many images are used to predict the bounding boxes during training and evaluation? Is it stereo pair for KITTI and multi-view for SUN RGB-D/ScanNet (if so, how are the multi-view inputs selected)?

Danila Rukhovich · Answer 1 · Wed Apr 05 2023 13:21:00 GMT+0800 (China Standard Time)

Hi @DianCh ,
We use a single image for KITTI and SUN RGB-D, 6 images for NuScenes and 50 images for ScanNet.

Dian Chen · Answer 2 · Thu Apr 06 2023 04:27:53 GMT+0800 (China Standard Time)

Thank you @filaPro for the reply! Just trying to understand the dataset protocol:

For SUN RGB-D, is it that only visible 3D gt boxes are used for supervision?
For ScanNet, is it that 20/50 images are randomly sampled per scene, and all 3D gt boxes in that scene are used for supervision/evaluation?

Danila Rukhovich · Answer 3 · Thu Apr 06 2023 04:35:23 GMT+0800 (China Standard Time)

For SUN RGB-D all boxes are visible, a a scene is represented by a single rgb-d image. For ScanNet you are right, all boxes are used for supervision and evaluation.