my-yy/AML

Introduction

This is a pytorch implementation of binary matching case described in 'Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching'

Requirements

python 3.6
librosa 0.7.2
numpy 1.19.0
torch 1.4.0
torchvision 0.5.0

Dataset

The network are trained on images from the and VGGFace dataset, and audio segments from VoxCeleb1. The VGGFace can be download from here. The VoxCeleb1 can be downloaded from here.

Experimental Result

Comparison Results of the proposed method

Here are the comparison results of audio-visual matching against state-of-the-art methods on both binary (k = 2) and multi-way (k = 10) cases.

Qualitative Results of the proposed method

Here are the qualitative results of audio-visual cross-modal matching of the proposed AML comparing to DIMNet, SVHF-Net in A → V challenge with k = 2.

Video demonstration

Here is a video demo to demonstrate the results of the proposed method.

Notice

The implementation of metric learning methods are included in

metric.py

The details of the method can be found in

@inproceedings{oh2016deep,
  title={Deep metric learning via lifted structured feature embedding},
  author={Oh Song, Hyun and Xiang, Yu and Jegelka, Stefanie and Savarese, Silvio},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={4004-4012},
  year={2016}
}

my-yy / AML