andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about fine-tune for full sep model

LionnelBall opened this issue · comments

commented

Really nice job!!! I have noticed that in the Self-supervised shift model, there is no gamma variable in slim.batch_norm for each conv layer(because there is no 'bn_scale' in shift_params.py). But as to the full speech-separation model, there is gamma in slim.batch_norm operation for each conv layer (‘bn_scale = True’ in sep_params.py). So, How could this full model be fine-tuned based on the shift model without gamma, since the 'gamma' differences exist in these two models respectively. If the weights in shift model and the corresponding weights in full model are the same, does the fine-tune make any sense?

Thanks! On this line:

opt_names += ['gen/', 'discrim/', 'global_step', 'gamma']
we specify that the gamma parameter should not be restored from the self-supervised network's checkpoint. Then, we re-initialize gamma to be approximately 1.

Sorry if that was confusing. In very early experiments, I was having trouble using the gamma parameter for the self-supervision task (it seemed to trend toward 0, and training would get stuck at chance performance), which is why I didn't use it there.

commented

Thanks for replying quickly that indeed solved my confusion! Another thing I am concerned is whether it is possible to make the separation model smaller while keep the performance much the same? less convolution kernel? or less layers?

I suggest decreasing the number of frequency bins in the STFT, and removing layers from the u-net model to compensate (e.g. by decreasing the frame_length_ms parameter). Other recent work (e.g. https://arxiv.org/pdf/1804.04121.pdf, https://arxiv.org/pdf/1804.03619.pdf) does fine with ~25% as many frequency bins. Hope that helps!