Official PyTorch code for extracting features and training downstream models with
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
(Logo generated by DALL·E 3)
- emotion2vec has been integrated into modelscope.
- We release the paper, and create a WeChat group for emotion2vec.
- We release code, checkpoints, and extracted features for emotion2vec.
emotion2vec is the first universal speech emotion representation model. Through self-supervised pre-training, emotion2vec has the ability to extract emotion representation across different tasks, languages, and scenarios.
emotion2vec achieves SOTA with only linear layers on the mainstream IEMOCAP dataset. Refer to the paper for more details.
emotion2vec achieves SOTA compared with SOTA SSL models on multiple languages (Mandarin, French, German, Italian, etc.). Refer to the paper for more details.
Refer to the paper for more details.
UMAP visualizations of learned features on the IEMOCAP dataset. Red and Blue tones mean low and high arousal emotional classes, respectively. Refer to the paper for more details.
We provide the extracted features of popular emotion dataset IEMOCAP. The features are extracted from the last layer of emotion2vec. The features are stored in .npy
format and the sample rate of the extracted features is 50Hz. The utterance-level features are computed by averaging the frame-level features.
- frame-level: Google Drive | Baidu Netdisk (password: zb3p)
- utterance-level: Google Drive | Baidu Netdisk (password: qu3u)
All wav files are extracted from the original dataset for diverse downstream tasks. If want to train with standard 5531 utterances for 4 emotions classification, please refer to the iemocap_downstream
folder.
The minimum environment requirements are python>=3.8
and torch>=1.13
. Our testing environments are python=3.8
and torch=2.01
.
- git clone repos.
pip install fairseq
git clone https://github.com/ddlBoJack/emotion2vec.git
- download emotion2vec checkpoint from:
- Google Drive
- Baidu Netdisk (password: b9fq)
- modelscope:
git clone https://www.modelscope.cn/damo/emotion2vec_base.git
- modify and run
scripts/extract_features.sh
- git clone repos.
pip install modelscope
git clone -b main --single-branch https://github.com/alibaba-damo-academy/FunASR.git
cd FunASR
pip install -e ./
- run the code.
from funasr import AutoModel
model = AutoModel(model="damo/emotion2vec_base", model_revision="v2.0.1")
wav_file = f"{model.model_path}/example/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(res)
The model will be downloaded automatically from modelscope.
We provide training scripts for IEMOCAP dataset in the iemocap_downstream
folder. You can modify the scripts to train your downstream model on other datasets.
If you find our emotion2vec code and paper useful, please kindly cite:
@article{ma2023emotion2vec,
title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
journal={arXiv preprint arXiv:2312.15185},
year={2023}
}