Inquiry about the HR image‘s resolution and their processing method

Question

Inquiry about the HR image‘s resolution and their processing method

wangjiyuan9 opened this issue a year ago · comments

Dear authors,

I am currently following your fantastic work and have noticed that in your paper, you mentioned fine-tuning the network on high-resolution images of 1280384. However, in the KITTI dataset, the maximum resolution shown in image_02 is 1242375. Therefore, I would like to inquire whether the high-resolution images were cropped or resized to achieve the resolution of 1280*384. If cropping was used, could you please provide more information on how the cropping was performed? If resizing was used, could you please clarify the method used for resizing? Thank you for your time and consideration.

Sincerely!

Ruoyu Wang · Answer 1 · Fri May 26 2023 21:06:19 GMT+0800 (China Standard Time)

Hi Jiyuan! Thank you for your interest and kind words. We do not crop the images and only resize them to 1280X384 during the finetune stage to utilize the vertical image position cue. The process code is here, which resizes the images using mode="bicubic".

You also pointed out that the image resolution of KITTI is just 1242X375. We upsample the images because we follow the FalNet's evaluation resolution in 1280X384, and the resolution during finetune has to be consistent with that during evaluation.

I hope this answers your question. Please feel free to contact me if you have any further issues.

王纪元 · Answer 2 · Wed May 31 2023 21:27:22 GMT+0800 (China Standard Time)

Thanks! WaWaYu!
Here is a Further question：
Can I assume that image resizing has little effect on depth estimation? For example, if I resize an image to 1216x352, add data augmentation, and then perform depth estimation, the difference in results between this approach and directly performing augment should not be significant, right?

Ruoyu Wang · Answer 3 · Wed May 31 2023 23:04:40 GMT+0800 (China Standard Time)

Hi! I would like to explain my opinion, please let me know if I misunderstand your meaning.
In my opinion, the relationship between resolution and performance is highly dependent on the training strategy.

Without Cropping:
As we know, if we only resize all inputs to resolution A without cropping during training, the network will "overfit" the resolution A, and will only perform well at A during evaluation. Then for any inputs in other resolutions, it will predict worse depth. In this case, you need to train a network specially for 1216x352 to get better performance.
For example, since our models are finetuned without cropping at 1280x384, the finetuned and distilled models will predict worse depth in 1216x352, so you need to finetune a new one in 1216x352. (Since the two resolutions are close, the effect may not be obvious.)

With Cropping:
Plane-based networks trained with cropping and resizing can work at different resolutions.
Since our stage1 model is plane-based and trained with cropping in 640x192, it can work in various resolutions. The result of 1216x352 is comparable to the result of 1280x384.

While ensuring that the networks (both w or w/o cropping) are tested with appropriate resolution, the higher the resolution is, the better the performance.
The result of 1280x384 is much better than that of 640x192. However, since 1216x352 and 1280x384 are both close to the origional resolution of 1242X375, they get comparable results.

I hope these help, please let me know if you have any further questions or I misunderstand your meaning ;)

王纪元 · Answer 4 · Mon Jun 05 2023 16:21:09 GMT+0800 (China Standard Time)

A very insightful understanding, thank you for your response, it was very helpful!