Feature map fusion without channel reduction?

Question

Feature map fusion without channel reduction?

opened this issue 6 years ago · comments

Firstly, thank you for the discussion in my previous post earlier! I definitely benefited from learning from you!

I have a question regarding channel reduction and would be glad if you can share your suggestions. In the paper, it was written, "...to decrease the memory consumption of the model, the number of channels in the feature map is reduced from 512 to 128 using 1 x 1 convolutions."
I am trying to run an experiment when only feature map fusion is done (without reducing channels). However, by doing so, I encountered exploded gradient. Of course, one possible direction is to reduce learning rate but I try not to do that yet as I suspect I may have implemented prototxt incorrectly or I might have missed out something that I did not realize... may I know if nan loss was encountered too in your experiment when channels are not reduced?

Iteration 20 (1.09985 iter/s, 18.1843s/20 iters), loss = nan
Train net output #0: m1@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #1: m1@ssh_reg_loss = nan (* 1 = nan loss)
Train net output #2: m2@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #3: m2@ssh_reg_loss = nan (* 1 = nan loss)
Train net output #4: m3@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #5: m3@ssh_reg_loss = nan (* 1 = nan loss)

Prototxt modification is as follows:

#==========CONV4 Backwards for M1======

# Upsample conv5_3
layer {
  name: "conv5_3_up"
  type: "Deconvolution"
  bottom: "conv5_3"
  top: "conv5_3_up"
  convolution_param {
    kernel_size: 4 
    stride: 2
    num_output: 512
    group: 512
    pad: 1
    weight_filler: { type: "bilinear" } 
    bias_term: false
  }
  param { lr_mult: 0 decay_mult: 0 }
}



# Crop conv5_3
layer {
  name: "conv5_3_crop"
  type: "Crop"
  bottom: "conv5_3_up"
  bottom: "conv4_3"
  top: "conv5_3_crop"
  crop_param {
    axis: 2
    offset: 0
  }
}

# Eltwise summation
layer {
  name: "conv4_fuse"
  type: "Eltwise"
  bottom: "conv5_3_crop"
  bottom: "conv4_3"
  top: "conv4_fuse"
  eltwise_param {
    operation: SUM
  }
}
# Perform final 3x3 convolution
layer {
  name: "conv4_fuse_final"
  type: "Convolution"
  bottom: "conv4_fuse"
  top: "conv4_fuse_final"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
    weight_filler { type: "gaussian" std: 0.01 }
    bias_filler { type: "constant" value: 0 }
  }
}
layer {
  name: "conv4_fuse_final_relu"
  type: "ReLU"
  bottom: "conv4_fuse_final"
  top: "conv4_fuse_final"
}

Pouya Samangouei · Answer 1 · Mon Jun 18 2018 21:05:49 GMT+0800 (China Standard Time)

Hi:) Try turning on gradient clipping in your solver proto. I'll read your layer more carefully later and reply if I see anything.

…

On Mon, Jun 18, 2018, 4:16 AM loackerc ***@***.***> wrote: Hi @po0ya <https://github.com/po0ya> and @mahyarnajibi <https://github.com/mahyarnajibi> , Firstly, thank you for the discussion in my previous post earlier! I definitely benefited from learning from you! I have a question regarding channel reduction and would be glad if you can share your suggestions. In the paper, it was written, "...to decrease the memory consumption of the model, the number of channels in the feature map is reduced from 512 to 128 using 1 x 1 convolutions." I am trying to run an experiment when only feature map fusion is done (without reducing channels). However, by doing so, I encountered exploded gradient. Of course, one possible direction is to reduce learning rate but I try not to do that yet as I suspect I may have implemented prototxt incorrectly or I might have missed out something that I did not realize... may I know if *nan* loss was encountered too in your experiment when channels are not reduced? Iteration 20 (1.09985 iter/s, 18.1843s/20 iters), loss = nan Train net output #0: ***@***.***_cls_loss = 87.3365 (* 1 = 87.3365 loss) Train net output #1: ***@***.***_reg_loss = nan (* 1 = nan loss) Train net output #2: ***@***.***_cls_loss = 87.3365 (* 1 = 87.3365 loss) Train net output #3: ***@***.***_reg_loss = nan (* 1 = nan loss) Train net output #4: ***@***.***_cls_loss = 87.3365 (* 1 = 87.3365 loss) Train net output #5: ***@***.***_reg_loss = nan (* 1 = nan loss) Prototxt modification is as follows: #==========CONV4 Backwards for M1====== # Upsample conv5_3 layer { name: "conv5_3_up" type: "Deconvolution" bottom: "conv5_3" top: "conv5_3_up" convolution_param { kernel_size: 4 stride: 2 num_output: 512 group: 512 pad: 1 weight_filler: { type: "bilinear" } bias_term: false } param { lr_mult: 0 decay_mult: 0 } } # Crop conv5_3 layer { name: "conv5_3_crop" type: "Crop" bottom: "conv5_3_up" bottom: "conv4_3" top: "conv5_3_crop" crop_param { axis: 2 offset: 0 } } # Eltwise summation layer { name: "conv4_fuse" type: "Eltwise" bottom: "conv5_3_crop" bottom: "conv4_3" top: "conv4_fuse" eltwise_param { operation: SUM } } # Perform final 3x3 convolution layer { name: "conv4_fuse_final" type: "Convolution" bottom: "conv4_fuse" top: "conv4_fuse_final" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 512 pad: 1 kernel_size: 3 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } } } layer { name: "conv4_fuse_final_relu" type: "ReLU" bottom: "conv4_fuse_final" top: "conv4_fuse_final" } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9CpIbaxESgBY2EkVwbiQxS3f_npFS5ks5t933igaJpZM4UrfVy> .

xiaofang wang · Answer 2 · Thu Jun 21 2018 21:55:42 GMT+0800 (China Standard Time)

HI
I have the same error during training SSH on a new dataset, the loss is 87 and nan, do you have any experiences? I have tried to change nms and threshold of positive samples as 0.8. The loss is changed and decreased, but result of testing the trained model is bad.

Deleted user · Answer 3 · Fri Jun 22 2018 11:21:26 GMT+0800 (China Standard Time)

@xiaofanglegoc, hi, is there any modification to the network other than using another dataset? You can try gradient clipping too as mentioned by @po0ya. That solved the problem I faced. :)

xiaofang wang · Answer 4 · Fri Jun 22 2018 21:53:15 GMT+0800 (China Standard Time)

@loackerc I have not changed the network, just change the lib/dataset/wider.py to my new dataset, and I have organized the new dataset in PASCAL VOC feature. The network take input from imdb.py

Deleted user · Answer 5 · Fri Jun 22 2018 22:05:53 GMT+0800 (China Standard Time)

Since it is a custom dataset, does the ground truth annotation format match your modified dataset code? For instance, original wider.py takes x1,y1,w,h as input if you haven’t modified the roidb part.

…

On Fri, 22 Jun 2018 at 9:53 PM, xiaofang wang ***@***.***> wrote: @loackerc <https://github.com/loackerc> I have not changed the network, just change the lib/dataset/wider.py to my new dataset, and I have organized the new dataset in PASCAL VOC feature. The network take input from imdb.py — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVYyO00n0Xk88PUFY3ZC2G65evo9wAtFks5t_PbNgaJpZM4UrfVy> .

xiaofang wang · Answer 6 · Mon Jul 16 2018 21:40:36 GMT+0800 (China Standard Time)

@po0ya could you please indicate more details on the gradient clipping? Thanks