NVlabs / neuralrgbd

Neural RGB→D Sensing: Per-pixel depth and its uncertainty estimation from a monocular RGB video

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

evaluation problem on Kitti eigen split

jiaxinxie97 opened this issue · comments

Hi, chao
In the released test code, we can only get the depth map and confidence map from frame 7 to frame N-9 of a test scene (N is length of test scene). But lots of 697 test images on test split are not included in the this range, such as frame 0. How do you do the evaluation?
According my understanding of your code, dmap are not used in the test stage. So I can delete the script that ignore the first and last 5 frames. But since t_win=2, we still can't get the depth map of frame 0, frame1, frame N-2 and frame N-1.

According to your paper and the closed issue#1, you only used the KITTI raw data. But in this script

dmap_file = '%s/%s/%s/proj_depth/groundtruth/image_02/%s'%( database_path_base, mode, sceneName, imgname)
We can see you use KITTI single depth prediction benchmark's depthmaps. (Although They are processed from raw 3D Velodyne point clouds, but they are denser and will perform better than depthmap projected by raw point clouds.)
Another question is default setting of the released code. In your default setting, KITTI dataset trained on KITTI single depth prediction benchmark split.
fun_get_paths = lambda traj_indx: dl_kitti.get_paths(traj_indx,split_txt= './mdataloader/kitti_split/training.txt', mode='train',
But test on eigen split,
split_file = './mdataloader/kitti_split/test_eigen.txt' if args.split_file=='.' else args.split_file
They are totally different split, some scenes in eigen test set also in benchmark training set. Such as 2011_09_26_drive_0009_sync. A scene can‘t appear in both the test set and the training set.

commented

According to your paper and the closed issue#1,

Yes, the projected depth from every 5 frames are used to get denser depth maps. And it should be downloaded from the link I provided there.

We didn't mention about the raw data in KITTI in the paper. For training, we use the projected depth map from every 5 frames for denser depth maps.

In issue #1, by 'raw data', I meant what was asked for, not the training data. But thanks for brining this up and I've reflected this point in the readme.

They are totally different split, some scenes in eigen test set also in benchmark training set. Such as 2011_09_26_drive_0009_sync. A scene can‘t appear in both the test set and the training set.

As commented in the test bash script you need to specify the test split unless you are testing on 7Scenes dataset.
This default setting is obsolete and never used. It was for training on the eigen spilit.

Thank you for you reply! I haven't double check your paper, so I remembered it wrong, sorry! If we download the denser depthmaps from kitti website, they use the projected depth map from every 11 laser scans for denser depth maps. So can I understand that you project the groundtruth depth from every 5 laser scans by yourself?
Another more important problem for me is how you get the results of 697 test images? As what I said
on my first comment. I can get the depth of the frame 0 and frame N-1 by using D-Net+R-Net, other middle frame by K-Net+R-Net. But I still want to know the how you get the results of your paper. Or can you upload your depth map and confidence map in these 697 test images, this makes it easier for us to compare.
Thank you for your time!

commented

on my first comment. I can get the depth of the frame 0 and frame N-1 by using D-Net+R-Net, other middle frame by K-Net+R-Net.

Yes. You can get the depths for frame 0 ~ N-1 by using D-Net and R-Net. As for K-Net + R-Net, you can use it on frame N-1, as long as there is depth estimation from previous frame, you should be able to use K-Net.

But I still want to know the how you get the results of your paper.

To get the results in the paper (including the methods for comparison), I was using the test data split in mdataloader/kitti_split/testing.txt . Then I run all methods on this split. There are 3439 frames in total in that split. I'm not sure where those '697 test images' are from. Maybe you are mentioning the test images for the single image depth estimation benchmark ?

As for how I deal with the first and last frames: I was simply excluding them from the evaluation (for all methods). Although the number is small compared with the total number of testing images (3439), a better way is to do what you described in the comment. But I expect that won't change the overall metric for the performance.

Oh, I see. You didn't train on eigen split but on single image depth estimation benchmark split. But three methods you compared to were trained on eigen split. If you use their pretrained model of these methods to evaluate your 3439 test images. There will be some images both on the training set and test set for these three methods. For example: 2011_09_26_drive_0005_sync is on training set of eigen split but also on your test images. It may sound like the evalutaion is good for them but bad for you, neural rgbd should exceed them more.
But it is still hard to say, you use more scenes to train and your test set are not standard. For example, 2011_10_03_drive_0047_sync and 2011_09_26_drive_0023_sync take up half of your test set. If it is possible that your training set contains scenes silimar to these two scenes but other method don't have?
When there are too many variables, we can't make a fair comparison. I said this not because I suspect that perfermace of neural rgbd is not SOTA. We can see that you have shown the top view point clouds that other methods will not show. But if the comparsion are not stated clearly, more and more follower will be confusing. If you retrain other three methods on mdataloader/kitti_split/training.txt, you can ignore what I said.

commented

The data split has been clarified int the .txt split files.
Given your concern, you are encouraged to train neural-rgbd on the data split you are using for other methods for comparison.
I agree with your point on the fairness about the comparison. That's one of the reasons why the cross-data set training and testing is needed (e.g. train on indoor dataset and test on outdoor dataset)