This project partially reflects the best practices in speaker verification technology up till 2020.
For more details, please refer to the link: The VoxSRC Workshop 2020.
-
Training data are prepared following Kaldi voxceleb recipe: VoxCeleb2_dev and its four augmented versions are generated using RIRS_NOISES and MUSAN datasets.
-
80-dimensional (or 40-dimensional) FBANKs are extracted.
-
Speaker augmentation is not applied.
-
Online data augmentation is not applied.
-
1D-conv model: 512-d TDNN model following Kaldi voxceleb recipe.
-
2D-conv models: Res2Net model and DPN (Dual Path Networks) model.
-
Large margin finetuning (LMFT).
-
Mixed precision training and distributed training.
-
Cosine similarity scoring.
-
Adaptive symmetric normalization (asnorm) on cosine scores.
-
Quality measure function (QMF) or score calibration is not applied.
-
Data preparation takes a long time. It is recommended to run data preparation script on a server. With a D48s_v5 Virtual Machine on Azure (48 vCPUs, 192 GB memory), it takes about 10 hours to complete the whole process.
-
For the smallest Res2Net model, res2net50_w8_s6_c16, it takes about 52 + 5 hours to train the model on a single NVIDIA Geforce RTX 4090 GPU.
-
For the largest Res2Net model, res2net200_w8_s6_c16, it takes about 76 + 8 hours to train the model on eight NVIDIA Geforce RTX 3090 GPUs.
-
Install TensorFlow 1.x and Kaldi toolkit manually
-
or setup a docker container with Dockerfile (recommended):
docker build -t xx205/tensorflow:23.03-tf1-py3-kaldi -f Dockerfile .
-
or pull the container directly (recommended):
docker pull xx205/tensorflow:23.03-tf1-py3-kaldi
-
Run docker container:
docker run --rm -it --gpus device=all -v `pwd`:`pwd` --ipc=host --ulimit memlock=-1 xx205/tensorflow:23.03-tf1-py3-kaldi
-
Clone this repository:
git clone https://github.com/xx205/voxsrc2020_speaker_verification
-
Run data preparation script (note, 80-dimensional FBANKs are extracted by default. FBANK dimension can be set in global_config.sh):
cd voxsrc2020_speaker_verification bash prepare_data.sh
-
Go to working directory:
cd tensorflow
-
Train a Res2Net model with JTBD's VoxSRC2020 setup on VoxCeleb2_dev_aug training set:
bash run_res2net_local_vox2_dev_aug.sh models.res2net_model res2net50_w24_s4_c32
-
Export trained checkpoint to pb file for inference:
bash export_inference_model.sh \ models.res2net_model \ res2net50_w24_s4_c32 \ exp/voxceleb2_dev_aug/res2net50_w24_s4_c32_sc_cm_linear_frames200_scale32_margin0.2_8GPUs \ 3 \ sc_cm_linear/kernel
-
Evaluate trained Res2Net model performance on VoxCeleb1 Test/Extended/Hard trials:
export saved_model=exp/voxceleb2_dev_aug/res2net50_w24_s4_c32_sc_cm_linear_frames200_scale32_margin0.2_8GPUs_122636.pb bash eval_inference_model.sh ${saved_model} 3 for partition in T E H; do for approach in cosine snorm; do python3 eer_minDCF.py \ --trial ../data/voxceleb1_trials/list_test_${partition}.txt \ --score ${saved_model%.pb}_embeddings/voxceleb1/${approach}_${partition}.txt done done
-
Finetune the Res2Net model with LMFT on VoxCeleb2_dev training set:
bash run_res2net_finetune_local_vox2_dev.sh models.res2net_model res2net50_w24_s4_c32
-
Export finetuned checkpoint to pb file for inference:
bash export_inference_model.sh \ models.res2net_model \ res2net50_w24_s4_c32 \ exp/voxceleb2_dev/res2net50_w24_s4_c32_sc_cm_linear_frames600_scale32_margin0.4_8GPUs \ 3 \ sc_cm_linear/kernel
-
Evaluate finetuned Res2Net model performance on VoxCeleb1 Test/Extended/Hard trials:
export saved_model=exp/voxceleb2_dev/res2net50_w24_s4_c32_sc_cm_linear_frames600_scale32_margin0.4_8GPUs_127968.pb bash eval_inference_model.sh ${saved_model} 3 for partition in T E H; do for approach in cosine snorm; do python3 eer_minDCF.py \ --trial ../data/voxceleb1_trials/list_test_${partition}.txt \ --score ${saved_model%.pb}_embeddings/voxceleb1/${approach}_${partition}.txt done done
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
---|---|---|---|---|---|---|
tdnn_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 3.5 M | Stats Pool |
dpn68_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 13.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
|
---|---|
tdnn_voxsrc2020 (cosine) | 3.4398%/0.3339 |
dpn68_voxsrc2020 (cosine) | 2.0894%/0.2544 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
---|---|---|---|---|---|---|
dpn68_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 13.9 M | Stats Pool |
res2net50_w24_s4_c64_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 26.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
|
---|---|---|---|
dpn68_voxsrc2020 (cosine) | 0.9517%/0.0884 | 1.0217%/0.1125 | 1.8649%/0.1833 |
dpn68_voxsrc2020 (asnorm) 1 | 0.8347%/0.0879 | 0.9452%/0.0996 | 1.6401%/0.1527 |
res2net50_w24_s4_c64_voxsrc2020 (cosine) | 0.9942%/0.1163 | 0.9924%/0.1152 | 1.8387%/0.1857 |
res2net50_w24_s4_c64_voxsrc2020 (asnorm) 2 | 0.8400%/0.0931 | 0.9051%/0.0970 | 1.5976%/0.1471 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
---|---|---|---|---|---|---|
dpn68 | sc_cm_linear | ✓ | 200 | 0.2 | 13.9 M | Stats Pool |
dpn68+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 13.9 M | Stats Pool |
res2net50_w24_s4_c64 | sc_cm_linear | ✓ | 200 | 0.2 | 26.9 M | Stats Pool |
res2net50_w24_s4_c64+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 26.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
|
---|---|---|---|
dpn68 (cosine) | 1.0952%/0.1052 | 1.1725%/0.1322 | 2.0998%/0.2053 |
dpn68 (asnorm) | 0.9783%/0.1081 | 1.0707%/0.1184 | 1.8656%/0.1742 |
dpn68+LMFT (cosine) | 0.8666%/0.0824 | 0.9634%/0.1058 | 1.6910%/0.1641 |
dpn68+LMFT (asnorm) | 0.8081%/0.0735 | 0.8934%/0.0925 | 1.5076%/0.1414 |
res2net50_w24_s4_c64 (cosine) | 1.0580%/0.1089 | 1.0728%/0.1226 | 1.9778%/0.1911 |
res2net50_w24_s4_c64 (asnorm) | 0.9038%/0.0989 | 0.9824%/0.1086 | 1.7269%/0.1621 |
res2net50_w24_s4_c64+LMFT (cosine) | 0.9198%/0.1005 | 0.9258%/0.1006 | 1.6957%/0.1650 |
res2net50_w24_s4_c64+LMFT (asnorm) | 0.7922%/0.0804 | 0.8482%/0.0903 | 1.4854%/0.1353 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
---|---|---|---|---|---|---|
res2net50_w24_s4_c64 | sc_cm_linear | ✓ | 200 | 0.2 | 32.2 M | Stats Pool |
res2net50_w24_s4_c64+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 32.2 M | Stats Pool |
res2net50_w24_s4_c32 | sc_cm_linear | ✓ | 200 | 0.2 | 17.7 M | Stats Pool |
res2net50_w24_s4_c32+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 17.7 M | Stats Pool |
res2net50_w8_s6_c16 | sc_cm_linear | ✓ | 200 | 0.2 | 4.8 M | Stats Pool |
res2net50_w8_s6_c16+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 4.8 M | Stats Pool |
res2net101_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 29.3 M | Att Stats Pool |
res2net101_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 29.3 M | Att Stats Pool |
res2net152_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 32.9 M | Att Stats Pool |
res2net152_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 32.9 M | Att Stats Pool |
res2net200_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 35.5 M | Att Stats Pool |
res2net200_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 35.5 M | Att Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
VoxSRC2022_dev (EER/minDCF0.05) |
|
---|---|---|---|---|
res2net50_w24_s4_c64 (cosine) | 0.9304%/0.0855 | 1.0345%/0.1109 | 1.8228%/0.1777 | |
res2net50_w24_s4_c64 (asnorm) | 0.7656%/0.0662 | 0.8989%/0.0964 | 1.5381%/0.1423 | |
res2net50_w24_s4_c64+LMFT (cosine) | 0.7762%/0.0702 | 0.8589%/0.0933 | 1.5236%/0.1506 | |
res2net50_w24_s4_c64+LMFT (asnorm) | 0.6805%/0.0579 | 0.8024%/0.0827 | 1.3703%/0.1210 | |
res2net50_w24_s4_c32 (cosine) | 0.8347%/0.0771 | 0.9893%/0.1104 | 1.8242%/0.1794 | |
res2net50_w24_s4_c32 (asnorm) | 0.7018%/0.0673 | 0.9062%/0.0986 | 1.5860%/0.1466 | |
res2net50_w24_s4_c32+LMFT (cosine) | 0.6327%/0.0622 | 0.8500%/0.0937 | 1.5352%/0.1549 | |
res2net50_w24_s4_c32+LMFT (asnorm) | 0.5529%/0.0625 | 0.8003%/0.0820 | 1.3751%/0.1271 | |
res2net50_w8_s6_c16 (cosine) | 1.0474%/0.1133 | 1.1721%/0.1288 | 2.0929%/0.2074 | |
res2net50_w8_s6_c16 (asnorm) | 0.9889%/0.1024 | 1.0659%/0.1164 | 1.8627%/0.1777 | |
res2net50_w8_s6_c16+LMFT (cosine) | 0.8666%/0.0839 | 0.9772%/0.1019 | 1.7360%/0.1678 | |
res2net50_w8_s6_c16+LMFT (asnorm) | 0.7869%/0.0821 | 0.8903%/0.0917 | 1.5370%/0.1414 | |
res2net101_w24_s4_c32_att (cosine) | 0.6539%/0.0655 | 0.7837%/0.0823 | 1.4125%/0.1395 | |
res2net101_w24_s4_c32_att (asnorm) | 0.5742%/0.0664 | 0.7230%/0.0749 | 1.2679%/0.1187 | |
res2net101_w24_s4_c32_att+LMFT (cosine) | 0.5795%/0.0512 | 0.6526%/0.0677 | 1.2099%/0.1174 | |
res2net101_w24_s4_c32_att+LMFT (asnorm) | 0.5210%/0.0534 | 0.6081%/0.0618 | 1.0940%/0.0999 | |
res2net152_w24_s4_c32_att (cosine) | 0.5476%/0.0507 | 0.7313%/0.0811 | 1.3453%/0.1331 | |
res2net152_w24_s4_c32_att (asnorm) | 0.4891%/0.0530 | 0.6754%/0.0729 | 1.1957%/0.1101 | |
res2net152_w24_s4_c32_att+LMFT (cosine) | 0.4732%/0.0425 | 0.6516%/0.0676 | 1.1957%/0.1119 | |
res2net152_w24_s4_c32_att+LMFT (asnorm) | 0.4572%/0.0463 | 0.6099%/0.0587 | 1.0737%/0.0965 | |
res2net200_w24_s4_c32_att (cosine) | 0.4944%/0.0418 | 0.7137%/0.0780 | 1.2897%/0.1267 | |
res2net200_w24_s4_c32_att (asnorm) | 0.4200%/0.0501 | 0.6658%/0.0690 | 1.1496%/0.1019 | |
res2net200_w24_s4_c32_att+LMFT (cosine) | 0.4041%/0.0390 | 0.6330%/0.0672 | 1.1641%/0.1113 | |
res2net200_w24_s4_c32_att+LMFT (asnorm) | 0.3668%/0.0388 | 0.5930%/0.0581 | 1.0330%/0.0912 | 1.5017/0.0974 |
Footnotes
-
DPN68 model with B+α+1 configuration (https://arxiv.org/pdf/2011.00200.pdf) ↩
-
Res2Net50 model with B+α+5 configuration (https://arxiv.org/pdf/2011.00200.pdf) ↩