may i ask How did you eliminate the difficulty of requiring phoneme audio alignment through predicting semantic latent?
rainbowjack opened this issue · comments
rainbowjack commented
Can you indicate in which file you implemented this feature?
and , As you wrote in Read Me: \ t<speakeer_id>\ t\ t<script>\ t<phonemixed_transscript>If these parameters cannot be replaced with placeholders, will the presence or absence of these parameters have a performance impact on the final trained model?
Songting commented
This repo will be updated soon with a new training pipeline eliminating the need for phone alignment and speaker labels.
It does not cause any performance degradation