An introduction to our model for dimensional speech emotion recognition based on wav2vec 2.0. The model is available from doi:10.5281/zenodo.6221127 and released under CC BY-NC-SA 4.0. The model was created by fine-tuning the pre-trained wav2vec2-large-robust model on MSP-Podcast (v1.7). The pre-trained model was pruned from 24 to 12 transformer layers before fine-tuning. In this tutorial we use the ONNX export of the model. The original Torch model is hosted at Hugging Face. Further details are given in the associated paper.
Create / activate Python virtual environment and install audonnx.
$ pip install audonnx
Load model and test on random signal.
import audeer
import audonnx
import numpy as np
url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')
archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)
sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)
model(signal, sampling_rate)
{'hidden_states': array([[-0.00711814, 0.00615957, -0.00820673, ..., 0.00666412,
0.00952989, 0.00269193]], dtype=float32),
'logits': array([[0.6717072 , 0.6421313 , 0.49881312]], dtype=float32)}
The hidden states might be used as embeddings for related speech emotion recognition tasks. The order in the logits output is: arousal, dominance, valence.
For a detailed introduction, please check out the notebook.
$ pip install -r requirements.txt
$ jupyter notebook notebook.ipynb
If you use our model in your own work, please cite the following paper:
@article{wagner2022dawn,
title={Dawn of the transformer era in speech emotion recognition: closing the valence gap},
author={Wagner, Johannes and Triantafyllopoulos, Andreas and Wierstorf, Hagen and Schmitt, Maximilian and Burkhardt, Felix and Eyben, Florian and Schuller, Bj{\"o}rn W.},
journal={arXiv preprint arXiv:2203.07378},
year={2022}
}