KinWaiCheuk / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TorToiSe

Setup

Python version

It was tested on Python 3.9.16.

Dependencies

It is better to first create a virtual envirnoment using conda.

conda create -n tortoise python=3.9

To install all dependencies, go inside the tortoise folder and do

pip install -r requirements.txt && pip install .

Generate from a string

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --candidates 3 --preset high_quality --seed 101

Generate from a file

python tortoise/read.py --textfile text.txt --voice random --candidates 3 --preset high_quality --seed 101
  • --output_path: Where to store outputs.', default='results/longform/
  • using the | symbol in the text file will split the output. Check combine.txt for more details.

Gernate from the second GPU

CUDA_VISIBLE_DEVICES=1 python tortoise/do_tts.py --text="whatever prompt 2" --voice="random"

This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.

Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py with the --regenerate argument.

Computer setup and speed

Below are the time required to generate 3 sentences of audio around 10s long. Here's the content of the text.txt:

Hello everyone. This is a test file.

I have fixed the slowness by installing older packages.

The audio generation is much faster with a G P U now.

Intel i9-13900 + RTX 4070

0:54 for "Generating autoregressive samples..."

0:03 for "Computing best candidates using CLVP"

0:55 for the first "Transforming autoregressive outputs into audio.."

0:41 for the second "Transforming autoregressive outputs into audio.."

0:35 for the third "Transforming autoregressive outputs into audio.."

Intel i7-12700KF + RTX 3080 Ti

0:49 for "Generating autoregressive samples..."

0:03 for "Computing best candidates using CLVP"

0:44 for the first "Transforming autoregressive outputs into audio.."

0:33 for the second "Transforming autoregressive outputs into audio.."

0:29 for the third "Transforming autoregressive outputs into audio.."

AMD Ryzen Threadripper 3970XF + RTX Titan XP

1:40 for "Generating autoregressive samples..."

0:09 for "Computing best candidates using CLVP"

1:23 for the first "Transforming autoregressive outputs into audio.."

1:43 for the second "Transforming autoregressive outputs into audio.."

0:54 for the third "Transforming autoregressive outputs into audio.."

About

A multi-voice TTS system trained with an emphasis on quality

License:Apache License 2.0


Languages

Language:Python 74.7%Language:HTML 23.5%Language:Jupyter Notebook 1.8%