mahyarnajibi / SSH

SSH: Single Stage Headless Face Detector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature map fusion without channel reduction?

opened this issue · comments

Hi @po0ya and @mahyarnajibi ,

Firstly, thank you for the discussion in my previous post earlier! I definitely benefited from learning from you!

I have a question regarding channel reduction and would be glad if you can share your suggestions. In the paper, it was written, "...to decrease the memory consumption of the model, the number of channels in the feature map is reduced from 512 to 128 using 1 x 1 convolutions."
I am trying to run an experiment when only feature map fusion is done (without reducing channels). However, by doing so, I encountered exploded gradient. Of course, one possible direction is to reduce learning rate but I try not to do that yet as I suspect I may have implemented prototxt incorrectly or I might have missed out something that I did not realize... may I know if nan loss was encountered too in your experiment when channels are not reduced?

Iteration 20 (1.09985 iter/s, 18.1843s/20 iters), loss = nan
Train net output #0: m1@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #1: m1@ssh_reg_loss = nan (* 1 = nan loss)
Train net output #2: m2@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #3: m2@ssh_reg_loss = nan (* 1 = nan loss)
Train net output #4: m3@ssh_cls_loss = 87.3365 (* 1 = 87.3365 loss)
Train net output #5: m3@ssh_reg_loss = nan (* 1 = nan loss)

Prototxt modification is as follows:

#==========CONV4 Backwards for M1======

# Upsample conv5_3
layer {
  name: "conv5_3_up"
  type: "Deconvolution"
  bottom: "conv5_3"
  top: "conv5_3_up"
  convolution_param {
    kernel_size: 4 
    stride: 2
    num_output: 512
    group: 512
    pad: 1
    weight_filler: { type: "bilinear" } 
    bias_term: false
  }
  param { lr_mult: 0 decay_mult: 0 }
}



# Crop conv5_3
layer {
  name: "conv5_3_crop"
  type: "Crop"
  bottom: "conv5_3_up"
  bottom: "conv4_3"
  top: "conv5_3_crop"
  crop_param {
    axis: 2
    offset: 0
  }
}

# Eltwise summation
layer {
  name: "conv4_fuse"
  type: "Eltwise"
  bottom: "conv5_3_crop"
  bottom: "conv4_3"
  top: "conv4_fuse"
  eltwise_param {
    operation: SUM
  }
}
# Perform final 3x3 convolution
layer {
  name: "conv4_fuse_final"
  type: "Convolution"
  bottom: "conv4_fuse"
  top: "conv4_fuse_final"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 512
    pad: 1
    kernel_size: 3
    weight_filler { type: "gaussian" std: 0.01 }
    bias_filler { type: "constant" value: 0 }
  }
}
layer {
  name: "conv4_fuse_final_relu"
  type: "ReLU"
  bottom: "conv4_fuse_final"
  top: "conv4_fuse_final"
}

HI
I have the same error during training SSH on a new dataset, the loss is 87 and nan, do you have any experiences? I have tried to change nms and threshold of positive samples as 0.8. The loss is changed and decreased, but result of testing the trained model is bad.

@xiaofanglegoc, hi, is there any modification to the network other than using another dataset? You can try gradient clipping too as mentioned by @po0ya. That solved the problem I faced. :)

@loackerc I have not changed the network, just change the lib/dataset/wider.py to my new dataset, and I have organized the new dataset in PASCAL VOC feature. The network take input from imdb.py

@po0ya could you please indicate more details on the gradient clipping? Thanks