Long Evaluation time and question about environment for training

Question

Long Evaluation time and question about environment for training

chasb799 opened this issue 8 months ago · comments

Hello,

first of all I wanted to thank you for the great work. I want to train the OS2D V2 model with my own data, but first I want to execute the training with the GroZi dataset to check if the training works. The training speed is pretty good but I noticed the the evaluation of the 84 images takes around 50 minutes. I wonder why the evaluation is so slow because its basically just a forward pass with score computation compared to the training which also needs a backward pass. Do you have any idea why this is happening? I use Python 3.7, Pytorch 1.13.1+cu117, CUDA 11.1 (but I think it will use CUDA 11.7 from the Pytorch installation) and tried different GPUs (GTX 1660, 1070 and A100). What environment did you use for training and evaluation (i saw the used GPUs in the paper)? Can I simulate the setting with the Dockerfile in the repo?

Anton Osokin · Answer 1 · Tue Dec 05 2023 17:34:44 GMT+0800 (China Standard Time)

Evaluation can be slow because of the image pyramid and specifically its largest layer (this is needed to deal with both small and large objects potentially present in the images). If you really want to get on the fly evaluation faster you can chop off the last pyramid layer or reduce its resolution.
As for hardware I did not use any GPUs faster than V100 when working on this project.

Anton Osokin · Answer 2 · Tue Dec 05 2023 17:40:47 GMT+0800 (China Standard Time)

Please see this reply: #65 (comment)

Simon Bauer · Answer 3 · Wed Dec 06 2023 17:32:01 GMT+0800 (China Standard Time)

But is there no image pyramid used while training? And is the pyramid not just a scaling of the input image before putting it into the feature extractor? Where in the architecture is this layer exactly? Sorry for the basic question, but object detection is a new area for me...

Anton Osokin · Answer 4 · Thu Dec 07 2023 20:04:39 GMT+0800 (China Standard Time)

There is no pyramid involved for training (we used random sampling of scale instead) but there is one for evaluation. Our pyramid means that we rescale a image multiple times an process each pyramid ayer with CNN. The runtime is roughly proportional to the total number of pixels in the pyramid (summed over all its layers).