andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

question about sourcesep training result on new dataset

xiaoyiming opened this issue · comments

I tried to train the sourcesep.py on  a  new data-set.  the data-set contain 12000 videos and trained about 2000 iteration.  the training results are as followed:

Iteration 0, lr = 1e-04, total:gen: 1.038 gen:reg: 0.155 diff-fg: 0.556 phase-fg: 0.006 diff-bg: 0.316 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 105.432
Iteration 1, lr = 1e-04, total:gen: 1.037 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 104.403
Iteration 2, lr = 1e-04, total:gen: 1.036 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 103.402
Iteration 3, lr = 1e-04, total:gen: 1.035 gen:reg: 0.155 diff-fg: 0.554 phase-fg: 0.006 diff-bg: 0.314 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 102.648
Iteration 4, lr = 1e-04, total:gen: 1.033 gen:reg: 0.155 diff-fg: 0.553 phase-fg: 0.006 diff-bg: 0.313 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 101.729
Iteration 5, lr = 1e-04, total:gen: 1.030 gen:reg: 0.155 diff-fg: 0.551 phase-fg: 0.006 diff-bg: 0.312 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.953
Iteration 6, lr = 1e-04, total:gen: 1.028 gen:reg: 0.155 diff-fg: 0.550 phase-fg: 0.006 diff-bg: 0.311 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.038
Iteration 7, lr = 1e-04, total:gen: 1.024 gen:reg: 0.155 diff-fg: 0.547 phase-fg: 0.006 diff-bg: 0.310 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 99.307
Iteration 8, lr = 1e-04, total:gen: 1.021 gen:reg: 0.155 diff-fg: 0.545 phase-fg: 0.006 diff-bg: 0.309 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 98.419
Iteration 9, lr = 1e-04, total:gen: 1.017 gen:reg: 0.155 diff-fg: 0.542 phase-fg: 0.006 diff-bg: 0.308 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 97.764
Iteration 10, lr = 1e-04, total:gen: 1.013 gen:reg: 0.155 diff-fg: 0.539 phase-fg: 0.006 diff-bg: 0.307 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 96.905
Iteration 20, lr = 1e-04, total:gen: 0.967 gen:reg: 0.155 diff-fg: 0.507 phase-fg: 0.006 diff-bg: 0.294 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 89.464
Iteration 30, lr = 1e-04, total:gen: 0.922 gen:reg: 0.154 diff-fg: 0.475 phase-fg: 0.006 diff-bg: 0.281 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000, time: 82.757
Iteration 40, lr = 1e-04, total:gen: 0.877 gen:reg: 0.153 diff-fg: 0.444 phase-fg: 0.006 diff-bg: 0.268 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000,
.....
Iteration 1800, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.358
Iteration 1810, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.319
Iteration 1820, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.505
Iteration 1830, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.447
Iteration 1840, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.346
Iteration 1850, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.312
Iteration 1860, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.403
Iteration 1870, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.404
Iteration 1880, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.202
Iteration 1890, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.469
Iteration 1900, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.318
Iteration 1910, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.159
Iteration 1920, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.241
Iteration 1930, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.028
Iteration 1940, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.877
Iteration 1950, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.739
Iteration 1960, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.555
Iteration 1970, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.00
Iteration 1980, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.283
Iteration 1990, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.177
Checkpoint: /home/zhang/xiao/multisensory-master/data/traing/sep_2s_test/net.tf-2000
Iteration 2000, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.973
Iteration 2010, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.958
Iteration 2020, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.821
Iteration 2030, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.806
Iteration 2040, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.881
As shown in the results, the training loss decreases . However, when the trained results are used to separate the video through the sep_video.py. we can only get the noise. could you give me some advises?

After I read the comments above, I noticed that the author said need to rewrite the I/O code. If I rewrite the I/O code, Should I read video and audio data separately, and then fed to two branch networks ? Or Convert data to TF format. When I rewrite the I/O code, where are details needs to be noticed. Looking forward to your reply and help me solve my uncertainty. Thank you very much!

Hi, I'm wondering have you fixed the problems now? I come across the same trouble.

It's hard to answer this without knowing more (if you're still having this problem). Did you train on VoxCeleb? Is the output really random noise, or just incorrect?

As for the I/O: you can do it either way (with a TFRecord or reading the audio and video through some other process). The code just expects a batch of audio-visual pairs.

commented

Hello
I'm not sure what the meaning of these losses, could you please explain it for me?
Iteration 2040, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.881

Hi,

diff-fg = L1 loss of on-screen spectrogram magnitude
diff-bg = L1 loss of off-screen spectrogram magnitude
phase-{fg,bg} = same as above, but for the spectrogram phase
reg: weight decay
total:gen: sum of losses

You can ignore the "discrim" (it's for a GAN loss that it isn't actually used in our paper).

commented

Thanks for your quick reply and kindly help @andrewowens 💯