It was tested on Python 3.9.16
.
It is better to first create a virtual envirnoment using conda.
conda create -n tortoise python=3.9
To install all dependencies, go inside the tortoise folder and do
pip install -r requirements.txt && pip install .
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --candidates 3 --preset high_quality --seed 101
python tortoise/read.py --textfile text.txt --voice random --candidates 3 --preset high_quality --seed 101
--output_path
: Where to store outputs.', default='results/longform/- using the
|
symbol in the text file will split the output. Checkcombine.txt
for more details.
CUDA_VISIBLE_DEVICES=1 python tortoise/do_tts.py --text="whatever prompt 2" --voice="random"
This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.
Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py
with the --regenerate
argument.
Below are the time required to generate 3 sentences of audio around 10s long.
Here's the content of the text.txt
:
Hello everyone. This is a test file.
I have fixed the slowness by installing older packages.
The audio generation is much faster with a G P U now.
0:54 for "Generating autoregressive samples..."
0:03 for "Computing best candidates using CLVP"
0:55 for the first "Transforming autoregressive outputs into audio.."
0:41 for the second "Transforming autoregressive outputs into audio.."
0:35 for the third "Transforming autoregressive outputs into audio.."
0:49 for "Generating autoregressive samples..."
0:03 for "Computing best candidates using CLVP"
0:44 for the first "Transforming autoregressive outputs into audio.."
0:33 for the second "Transforming autoregressive outputs into audio.."
0:29 for the third "Transforming autoregressive outputs into audio.."
1:40 for "Generating autoregressive samples..."
0:09 for "Computing best candidates using CLVP"
1:23 for the first "Transforming autoregressive outputs into audio.."
1:43 for the second "Transforming autoregressive outputs into audio.."
0:54 for the third "Transforming autoregressive outputs into audio.."