deep-neural-networks audiovisual-classification multimodal cross-modal keras multimedia audiovisual classification

XFlow

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
IEEE Transactions on Neural Networks and Learning Systems 2019, IEEE ICDL-EPIROB Workshop on Computational Models for Crossmodal Learning (CMCML) 2017, ARM Research Summit 2017
Cătălina Cangea, Petar Veličković, Pietro Liò

We propose XFlow, cross-modal deep learning architectures that allow for dataflow between several feature extractors. Our models derive more interpretable features and achieve better performances than models which do not exchange representations. They represent a novel method for performing cross-modality before features are learned from individual modalities, usefully exploiting correlations between audio and visual data, which have a different dimensionality and are nontrivially exchangeable. We also provide the research community with Digits, a new dataset consisting of three data types extracted from videos of people saying the digits 0-9. Results show that both cross-modal architectures outperform their baselines (by up to 11.5%) when evaluated on the AVletters, CUAVE and Digits datasets, achieving state-of-the-art results.

Getting started

$ git clone https://github.com/catalina17/XFlow
$ virtualenv -p python3 xflow
$ source xflow/bin/activate
$ pip install tensorflow-gpu==1.8.0
$ pip install keras==2.1.4

Dataset

The Digits benchmark data can be found here. After expanding the archive in a specific directory, please update BASE_DIR (declared in Datasets/data_config.py) with that directory.

Running the models

The script eval.py contains command-line arguments for models and datasets. For example, you can run the {CNN x MLP}--LSTM baseline on Digits as follows:

CUDA_VISIBLE_DEVICES=0 python eval.py --model=cnn_mlp_lstm_baseline --dataset=digits --batch_size=64

Citation

Please cite us if you get inspired by or use XFlow and/or the Digits dataset:

@ARTICLE{8894404,
  author={C. {Cangea} and P. {Veličković} and P. {Liò}},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  title={XFlow: Cross-Modal Deep Neural Networks for Audiovisual Classification},
  year={2019},
  volume={}, number={}, pages={1-10},
}

About

Generalized cross-modal NNs; new audiovisual benchmark (IEEE TNNLS 2019)

deep-neural-networks audiovisual-classification multimodal cross-modal keras multimedia audiovisual classification

Languages

Language:Python 100.0%