aosokin / os2d

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Long Evaluation time and question about environment for training

chasb799 opened this issue · comments

Hello,

first of all I wanted to thank you for the great work. I want to train the OS2D V2 model with my own data, but first I want to execute the training with the GroZi dataset to check if the training works. The training speed is pretty good but I noticed the the evaluation of the 84 images takes around 50 minutes. I wonder why the evaluation is so slow because its basically just a forward pass with score computation compared to the training which also needs a backward pass. Do you have any idea why this is happening? I use Python 3.7, Pytorch 1.13.1+cu117, CUDA 11.1 (but I think it will use CUDA 11.7 from the Pytorch installation) and tried different GPUs (GTX 1660, 1070 and A100). What environment did you use for training and evaluation (i saw the used GPUs in the paper)? Can I simulate the setting with the Dockerfile in the repo?

Evaluation can be slow because of the image pyramid and specifically its largest layer (this is needed to deal with both small and large objects potentially present in the images). If you really want to get on the fly evaluation faster you can chop off the last pyramid layer or reduce its resolution.
As for hardware I did not use any GPUs faster than V100 when working on this project.

Please see this reply: #65 (comment)

But is there no image pyramid used while training? And is the pyramid not just a scaling of the input image before putting it into the feature extractor? Where in the architecture is this layer exactly? Sorry for the basic question, but object detection is a new area for me...

There is no pyramid involved for training (we used random sampling of scale instead) but there is one for evaluation. Our pyramid means that we rescale a image multiple times an process each pyramid ayer with CNN. The runtime is roughly proportional to the total number of pixels in the pyramid (summed over all its layers).