Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

Question

Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

jiazj-jiazj opened this issue 2 years ago · comments

i tried the pre-trained model--bneSeq2seqMoL-vctk-libritts460-oneshot provided in bneSeq2seqMoL-vctk-libritts460-oneshot.Then converted source wavs to target wavs in provided demo by your paper .Yours in paper performed better than the model trained in the original paper.Why? have u trained the hifi-gan model again?Thank u!

li1jkdaw · Answer 1 · Mon May 15 2023 00:00:49 GMT+0800 (China Standard Time)

Hi, @jiazj-jiazj !

No, we didn't fine-tune the HiFi-GAN vocoder, we just took the code from this repo as is, with the vocoder checkpoint they provided and the recommended bneSeq2seqMoL-vctk-libritts460-oneshot model. I'm not sure what might go wrong and why you had the results of voice conversion with this model worse the ones from our demo.

I used this model to reproduce the experiment you described (just took source and reference voice samples from our demo) and got the same BNE-PPG-VC results as in our demo. See the results of my experiments here.

Perhaps the reason is that the output audios produced by the voice conversion model from the mentioned repo were loudness-normalized and only then put to our demo here, so in the demo the loudness might be less so the quality might seem better. Also note that in our demo we have 16kHz audio while the BNE-PPG-VC model outputs 24kHZ, so we also downsampled the audio before putting it to our demo. These are the only things I can think of.