TTS: Excessive silence at the end of audio generated using gu-IN-DhwaniNeural voice

Question

TTS: Excessive silence at the end of audio generated using gu-IN-DhwaniNeural voice

luzhanov opened this issue 4 months ago · comments

Describe the bug
Audios generated for gu-IN locale using voice gu-IN-DhwaniNeural contains about 3 sec silence at the end of audio file. The same generation, performed using gu-IN-NiranjanNeural voice, produced a normal file without long silence (see attached samples and screenshot).

Here is a length difference between gu-IN-NiranjanNeural voice (shorter) and gu-IN-DhwaniNeural voice (longer) on the same text above:

Audio files generated:
gu-audios.zip

To Reproduce
Use next SSML for audio generation:

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts"
 version="1.0" xml:lang="gu-IN">
  <voice name="gu-IN-DhwaniNeural">
	<mstts:silence type="Leading-exact" value="0ms"/>ઉનાળો મારી પ્રિય મોસમ છે.<mstts:silence type="Tailing-exact" value="0ms"/>
  </voice>
</speak>

Expected behavior
gu-IN-DhwaniNeural voice should generate audio without a long (~3sec) silence at the end for the SSML with <mstts:silence type="Tailing-exact" value="0ms"/>

Version of the Cognitive Services Speech SDK
Java SDK 1.36.0

Platform, Operating System, and Programming Language

OS: amazonlinux:2
Programming language: Java