Personalized Speech Synthesis with a Scalable Speech Diffusion Transformer and Controllable Voice Style Transfer
Although speech synthesis systems have remarkably advanced with their expansion into various applications, achieving robust voice style transfer while maintaining high-quality in zero-shot scenarios still remains challenging. In this paper, we propose a neural speech synthesis system with speech diffusion transformer (SDT) to effectively perform style transfer even in low-resource and zero-shot scenarios. We introduce a diffusion-based voice conversion network with strong style adaptation performance. We explore a transformer-based diffusion backbone network and style conditioning method that can simultaneously capture spatial and temporal information of acoustic characteristics for more efficient and robust speaker adaptation. This significantly improves the style adaptation performance compared to existing diffusion-based speech synthesis systems. Additionally, we propose a flexible neural pitch control method for personalized voice style transfer. Particularly, the styles between the source and target speakers in a voice conversion scenario can be flexibly adjusted to synthesize a personalized speech style appropriate for the application. Our experimental results demonstrate that the proposed method significantly improves the style transfer performance and pronunciation intelligibility while exhibiting, superior performance on low-resource and real-world data. Moreover, the proposed SDT can be easily extended to personalized speech synthesis tasks, such as voice conversion and text-to-speech, without re-training the model for each task.
- TBA
- The paper was completed in December 2023.