aosokin / os2d

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

help in change of the architecture of the net

roitmaster opened this issue · comments

Hi, great work!

I tried to use your code and the v2 model on my data and the result is amazing.
But because my data have a big variation in the size of the image and the distance to the objects I get worse results.
To overcome this issue I tried the following:

  1. work on several pyramid levels.
    But because some images are big but took from a distance I need to increase the size of the image by the pyramid multiplayer as a result of the huge size the GPU memory crash

  2. to work as a sliding window on the image - but again because of the size of the images, it takes a lot of time to process one image.

now, after going to the depth of your work I saw that you working directly on the C4 level of the feature extractor net ("resnet50").

my idea is to create FPN net and use layers C3, C4, and C5 and to create P3 to P7.
On each created layer to apply the os2head:

to do that I need to update the receptive field and the stride of the box generator and the size of the feature map size for each layer during the entire code.

the question of mine is how to determinate these parameters while:
the input size is 512

Size of the layers
P3: ([1, 128, 64, 64])
P4: ([1, 128, 32, 32])
P5: ([1, 128, 16, 16])
P6: ([1, 128, 8, 8])
P7: ([1, 128, 4, 4])

it would be wonder-full if you could also explain how did you calculate these parameters

Hi, thanks for your feedback!

But because my data have a big variation in the size of the image and the distance to the objects I get worse results.

This is indeed difficult for our method. Currently, we are softly assuming that all objects of interest get roughly the same number of pixels as the class templates. If this is hard to achieve by resizing and cutting input images the results will be not great.

To overcome this issue I tried the following:

The approaches you mentioned are indeed valid ways to get results :-)

it would be wonder-full if you could also explain how did you calculate these parameters

Please note the "effective receptive field" we are using is not really a receptive field but more like an anchor size converted to the image-related coordinates from the feature-related coordinates.
When defining the "effective receptive field" we assume that 3x3 convolutions with stride 1 do not change the receptive field and all the changes come from the layers with stride > 1 (conv or pool).
For example, each layer with stride=2 (typical case in ResNet) doubles both the "effective receptive field" and stride.

my idea is to create FPN net and use layers C3, C4, and C5 and to create P3 to P7.

This is a possible approach. However, note that to compute correlations you will need compatible features at different levels of the pyramid of features. The standard FPN scheme does not give this because the layers between the levels are not forced to be compatible. I don't know if training them jointly will be enough to get this compatibility.