rhasspy / larynx

End to end text to speech system using gruut and onnx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SSML file not processing under --ssml flag

PeterSprague opened this issue · comments

Testing both Larynx and Larynx.server install via pip3 in a venv. All dependencies are satisfied. Fedora 34 all up to date.

Using the example SSML in a file TTS-SSML_test.txt:
larynx.server --> input contents of file into input box and run. SSML checkbox unchecked or checked = voice recognizing ssml cmds and not reading them

Using larynx from cmd line:
$ python3 -m larynx -v southern_english_female-glow_tts < TTS-SSML_test.txt
reads whole file including all the SSML statements

$ python3 -m larynx --ssml -v southern_english_female-glow_tts < TTS-SSML_test.txt
errors:
Traceback (most recent call last):
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 479, in process
root_element = etree.fromstring(text)
File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 7

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 720, in
main()
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 294, in main
for result_idx, result in enumerate(tts_results):
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/init.py", line 71, in text_to_speech
for sentence in gruut.sentences(
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/init.py", line 79, in sentences
graph, root = text_processor(text, lang=lang, ssml=ssml, **process_args)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 432, in call
return self.process(*args, **kwargs)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 483, in process
root_element = etree.fromstring(f"{text}")
File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 22

Also tried piping the file in via cat:
cat TTS-SSML_test.txt | python3 -m larynx --ssml -v southern_english_female-glow_tts
Same error
Produces audio file without the --ssml flag, but as above includes all the SSML statements

Been through the documentation page and tried the examples to narrow this down. There is nothing specific to using a SSML specific file to produce the audio. Non-SSML examples all work on my workstation

Would like to get this working for a small project that produces training audio files of Shorin-Ryu Karate Yakusokus for my Black belt test practice

Thanks,

Hi @PeterSprague, thanks for trying out Larynx 🙂

Can you post an example of your SSML? I can't seem to reproduce the issue on my machine. Maybe I have something wrong off my SSML parser.

I directly copied your SSML example from the README:
TTS-SSML_test.txt

$ python3 -m larynx --ssml -v southern_english_female-glow_tts < TTS-SSML_test.txt

OK, I see what's happening now. The command-line interface for Larynx is line-based -- it assumes each line is an individual utterance. If you remove the newline characters, it should work fine.

I may need to consider if --ssml should imply reading the entire input as one utterance, or of some other flag should indicate this.

remove the newline characters

I'm missing something here. Are you saying to create a mixed blob of text and ssml cmds? How is that even decipherable by a human writer once the file gets more than a few "sentences"?

Here is a copy of my espeak-ng ssml file that is working well. Other than voice name this should also be able to be read by Larynx
Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt

$ espeak-ng -f Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt -s 150 -p 50 -l 30 -k20 -m

No, I'm suggesting something like this as a workaround:

tr < Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt '\n' ' ' | bin/larynx --ssml -v en-us

If the input all goes on one line into Larynx, it will be read correctly. This is intended to allow multiple sets of sentences to come in, like:

<speak>1st set of sentences</speak>
<speak>Next set of sentences</speak>
...

but I think with SSML, people will expect it to read the entire input at once.

OK, stripping the newline as it "reads the file.

$ tr < Yakusoku-6_attacker_detail_TTS-SSML-Espeak.txt '\n' ' ' | python3 -m larynx --ssml -v en-us

Works well, thanks

When do you think you will be adding to the SSML set to give increased control over the delivery?

What sorts of SSML tags do you think would be most useful?

TTS and SSML very new to me, with my background being more on computer-vision and DL to assess ecological impacts.

I guess it really comes back to interests and/or business case. Are you wanting to create a self-hosted TTS solution using ML technigues to provide alternatives to Azure or Google? Then I would follow their sub-sets of SSML. Otherwise if wanting to use for more specific cases, then honing the sub-set to what enhances that usage might be the preferred development direction.

For my usage, based on https://www.w3.org/TR/speech-synthesis11/#S3.2, I think having control of the voice characteristics via "3.2.4 prosody Element" would be good.

Fixed the --ssml input mode in Larynx 1.1 (it now reads the entire input).

Regarding prosody, I can control the rate and volume with GlowTTS (Larynx's TTS model), but pitch and contour aren't something that can be changed in the model.