large-language-models speech-processing speech-synthesis text-to-speech transformers

SoundStorm: Efficient Parallel Audio Generation

Work In Progress ...

SoundStorm is a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec.

Pre-processing and Training Scripts:

DataSet :

Pre-processing and Data format follows this: https://huggingface.co/datasets/collabora/whisperspeech

Start Training:

python train.py

Semantic token path: ./data/whisperspeech/whisperspeech/librilight/stoks/

Acoustic token path: ./data/whisperspeech/whisperspeech/librilight/encodec-6kbps/

References :

MaskGIT code : https://github.com/dome272/MaskGIT-pytorch
SoundStorm : https://github.com/feng-yufei/shared_debugging_code

About

Google's SoundStorm: Efficient Parallel Audio Generation

large-language-models speech-processing speech-synthesis text-to-speech transformers

MIT License

Languages

Language:Python 100.0%