Regarding speaker recognition in PASE
Some-random opened this issue · comments
I'm a bit confused about the experimental setup for speaker recognition described in the original PASE paper. If my understanding is correct, only 15s * 2484, or 10.35 hours of speech from librispeech is used for pretraining, and only 11s * 109, or 0.33 hours of speech from VCTK is used for fine-tuning. Both numbers seem awfully small...