macroustc/3D-Speaker

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope.

Quickstart

Install 3D-Speaker

git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: CAM++ on voxceleb
cd egs/sv-cam++/voxceleb/
bash run.sh
# Self-supervised speaker verification: RDINO on voxceleb
cd egs/sv-rdino/voxceleb/
bash run.sh
# Speaker verification: ERes2Net on voxceleb
cd egs/sv-eres2net/voxceleb/
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# CAM++ trained on VoxCeleb
model_id=damo/speech_campplus_sv_en_voxceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# Run cam++ inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run rdino inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path

# ERes2Net trained on VoxCeleb
model_id=damo/speech_eres2net_sv_en_voxceleb_16k
# Run ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

Task	Dataset	Model	Performance
speaker verification	VoxCeleb	CAM++	Vox1-O EER = 0.73%
self-supervised speaker verification	VoxCeleb	RDINO	Vox1-O EER = 3.16%
speaker verification	200k-speaker dataset	CAM++	CN-Celeb-test EER = 4.32%
speaker verification	VoxCeleb	ERes2Net	Vox1-O EER = 0.97%

News

[2023.5] ERes2Net training recipes on VoxCeleb released. ERes2Net incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion fuses the features within one single residual block to extract the local signal. The global feature fusion takes acoustic features of different scales as input to aggregate global signal.
[2023.5] ERes2Net pretrained model released, trained on VoxCeleb.
[2023.4] RDINO training recipes on VoxCeleb released. RDINO is a self-supervised learning framework in speaker verification aiming to alleviate model collapse in non-contrastive methods. It contains teacher and student network with an identical architecture but different parameters. Two regularization terms are proposed in RDINO, namely diversity regularization and redundancy elimination regularization. RDINO achieve 3.16% EER and 0.223 MinDCF in VoxCeleb using single-stage self-supervised training.
[2023.4] CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers. It achieves an EER of 4.32% in CN-Celeb test set.
[2023.4] CAM++ training recipe on VoxCeleb released. CAM++ is a fast and efficient speaker embedding extractor based on a densely connected time-delay neural network (D-TDNN). It adopts a novel multi-granularity pooling method to conduct context-aware masking. CAM++ achieves an EER of 0.73% in Voxceleb and 6.78% in CN-Celeb, outperforming other mainstream speaker embedding models such as ECAPA-TDNN and ResNet34, while having lower computational cost and faster inference speed.

To be expected

[2023.6] Releasing ERes2Net model trained on over 200k labeled speakers.
[2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks.

Contact

If you have any comment or question about 3D-Speaker, please contact us by

email: {zsq174630, chenyafeng.cyf, tongmu.wh, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:

Citations

If you are using RDINO model in your research, please cite:

@inproceedings{chen2023pushing,
  title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

If you are using CAM++ model in your research, please cite:

@article{cam++,
  title={CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking},
  author={Hui Wang and Siqi Zheng and Yafeng Chen and Luyao Cheng and Qian Chen},
  booktitle={Interspeech 2023},
  year={2023},
  organization={IEEE}
}

If you are using ERes2Net model in your research, please cite:

@article{eres2net,
  title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
  author={Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi},
  booktitle={Interspeech 2023},
  year={2023},
  organization={IEEE}
}

macroustc / 3D-Speaker