andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page:http://andrewowens.com/multisensory/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to train the "shift" and "cam" model for sound source location?

yxixi opened this issue · comments

commented

First of all, thank you for your earlier reply! Now I've got two more questions about your great work.
I've noticed that there are three models: "shift", "cam" and "sep". To my acknowledge, the "sep" model is for source seperation, and the "cam" model is for localization. And there are pretrained-model files for these models, such as:
model_file = '../results/nets/shift/net.tf-650000'
model_file = '../results/nets/cam/net.tf-675000'
Now I wonder how to train the "shift" model and "cam" model for sound source location. Could you give the detailed method to call the training function in shift_net.py? Which dataset should I use?
Looking forward for your reply :)

Sorry if that was confusing. We trained the "shift" model videos from AudioSet: https://research.google.com/audioset/ for 650k iterations. Then, to train the CAM model we removed a spatial stride from the model and fine-tuned it for 25k more iterations (that gives it higher spatial resolution). And yes, "sep" is for the source separation. Pretrained models for both can be downloaded using the ./download_models.sh script. Hope that helps.

commented

Sorry if that was confusing. We trained the "shift" model videos from AudioSet: https://research.google.com/audioset/ for 650k iterations. Then, to train the CAM model we removed a spatial stride from the model and fine-tuned it for 25k more iterations (that gives it higher spatial resolution). And yes, "sep" is for the source separation. Pretrained models for both can be downloaded using the ./download_models.sh script. Hope that helps.

@andrewowens
Thank you for your reply! What confuses me most is the question below:

I train the "sep" model like this

python -c "import sep_params, sourcesep; sourcesep.train(sep_params.full(num_gpus=3), [0, 1, 2], restore = False)"

But how to run the "shift" model? Looking forward for your reply :)

For training the shift model, the code is very similar:
python -c "import shift_params, shift_net; shift_net.train(shift_params.shift_v1(num_gpus=3), [0, 1, 2], restore = False)"

As in the source separation case, you'll have to rewrite the I/O code (my code uses TFRecordReader, but this is very space inefficient, and there are probably better ways to do it).

As for running a trained network, please see shift_example.py for an example of generating a the CAM (you'd have to slightly modify, though, to get a shifted/not-shifted prediction, if that's what you want).

commented

For training the shift model, the code is very similar:
python -c "import shift_params, shift_net; shift_net.train(shift_params.shift_v1(num_gpus=3), [0, 1, 2], restore = False)"

As in the source separation case, you'll have to rewrite the I/O code (my code uses TFRecordReader, but this is very space inefficient, and there are probably better ways to do it).

As for running a trained network, please see shift_example.py for an example of generating a the CAM (you'd have to slightly modify, though, to get a shifted/not-shifted prediction, if that's what you want).

Got it. Thanks for your reply:)