audio-visual-correspondence audio-visual-learning overfitting self-supervised-learning silence visual-sound-localization

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

Official codebase for SLAVC.

SLAVC is a new approach for weakly-supervised visual sound source localization to identify negatives and solve significant overfitting problems.

A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Shentong Mo, Pedro Morgado
NeurIPS 2022.

Environment

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

Flickr-SoundNet

Data can be downloaded from Learning to localize sound sources

VGG-Sound Source

Data can be downloaded from Localizing Visual Sounds the Hard Way

Extended Flickr-SoundNet

Data can be downloaded from Extended-Flickr-SoundNet

Extended VGG-Sound Source

Data can be downloaded from Extended-VGG-Sound Source

Model Zoo

We release MoVSL model pre-trained on VGG-Sound 144k data and scripts on reproducing results on Extended Flickr-SoundNet and Extended VGG-Sound Source benchmarks.

Method	Train Set	Test Set	AP	max-F1	Precision	url	Train	Test
SLAVC	VGG-Sound 144k	Extended Flickr-SoundNet	51.63	59.10	83.60	model	script	script
SLAVC	VGG-Sound 144k	Extended VGG-SS	32.95	40.00	37.79	model	script	script

Train

For training an SLAVC model, please run

python train.py --multiprocessing_distributed \
    --train_data_path /path/to/VGGSound-all/ \
    --test_data_path /path/to/Flickr-SoundNet/ \
    --test_gt_path /path/to/Flickr-SoundNet/Annotations/ \
    --experiment_name vggss144k_slavc \
    --model 'slavc' \
    --trainset 'vggss_144k' \
    --testset 'flickr' \
    --epochs 20 \
    --batch_size 128 \
    --init_lr 0.0001 \
    --use_momentum --use_mom_eval \
    --m_img 0.999 --m_aud 0.999 \
    --dropout_img 0.9 --dropout_aud 0

Test

For testing and visualization, simply run

python test.py --test_data_path /path/to/Extended-VGGSound-test/ \
    --model_dir checkpoints \
    --experiment_name vggss144k_slavc \
    --testset 'vggss_plus_silent' \
    --alpha 0.9 \
    --relative_prediction

Citation

If you find this repository useful, please cite our paper:

@inproceedings{mo2022SLAVC,
  title={A Closer Look at Weakly-Supervised Audio-Visual Source Localization},
  author={Mo, Shentong and Morgado, Pedro},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

About

Official Codebase of "A Closer Look at Weakly-Supervised Audio-Visual Source Localization" (NeurIPS 2022)

audio-visual-correspondence audio-visual-learning overfitting self-supervised-learning silence visual-sound-localization

Apache License 2.0

Languages

Language:Python 98.6%Language:Shell 1.4%