Personal tool kit to build a breton speech-to-text model with Kaldi.
Split files .split should be in the same folder as its related audio file and text file.
Install required python modules with
pip install -r requirements.txt
Numerals should be written in all letters
(E.g. "1942" --> "mil nav c'hant daou ha daou ugent")
Acronyms should be written in capital letters without dots or dashes
(E.g. UNESCO, COP21)
Regional forms should be kept if their pronouciation isn't too far from the "standard" form (according to personnal taste or political views...)
(E.g. "meump bet/on eus bet", "nege/an hini eo", "lâret/lavaret")
Each piece of data is made of 3 files : a wave file (16b, 16 KHz sample rate), a split file and a text file containing the transcription.
In the transcription files, empty lines and lines starting with the character #
should be ignored. Line comments can thus be embedded in the transcription files by using the #
character.
The first line of split files can include a header with the arguments given to wavsplit
. The header line must start with the character #
.
Inline metadata can also by found anywhere between the characters {
and }
and should be removed before the training process.
The *
character can be added if front of words you want remove from the corpus and the lexicon (foreign words, repetitions...). Marked words will still be used for training the acoustic model.
Run script unpack.py
found in each dataset folder.
Run script wavesplit.py audiofile.wav
on each audiofile in datasets to generate .split
file and manually align text (in .txt
file) with audio utterances.
Run script verify_text_files.py dataset/
on each dataset folder to quickly check spelling mistakes and register acronyms.
Those three first step need to be done only once per audio file. The following steps can be done every time you train a new model.
Create a corpus
folder and inside it, divide datasets in train
and test
sets in their respective folders.
Run script build_kaldi_files.py folder/
on train
and test
folders.
Copy generated data
folder in your Kaldi recipe folder.
Run run.sh
script in kaldi recipe folder.
Cry, wishing you had a GPU, while waiting for the model to finish its training (more than a hundred hours in my case).
When training is finally done, run script copy_final_result.sh
to copy necessary files in model
folder.
Rename model
folder to your liking and copy it in vosk model
subfolder.
Enjoy !
Tool used to extract corpus from wikipedia: https://github.com/attardi/wikiextractor
Wikipedia dumps: https://dumps.wikimedia.org/brwiki/ https://dumps.wikimedia.org/brwikiquote/
python3 WikiExtractor.py -o dumps --json brwiki-20220920-pages-articles-multistream.xml.bz2
corrected.txt
List of key -> values separated by a tabulation.
Two types of corrections, single token substitution or raw text substitution (when a space is present in key).
Keys of single token are case insensitive but corrections must be capitalized when appropriate.capitalized.txt
List of single words followed by phonemesacronyms.txt
List of acronyms (in capital letters) followed by phonemeslexicon_add.txt
Store additional regional pronouciations for breton wordslexicon_replace.txt
Replace phonetics for words in lexicon (foreign words for instance)
- Build a better tokenizer
- Text normalization
- Try reducing the size of the model (number of hidden layer) as well as the number of epochs
- Phonemes "YE N", "YOU" for plural markers rather than "I E N", "I OU" ?
- Phoneme "GW" rather than "G OU" ?
- Add breton sentences to https://commonvoice.mozilla.org/sentence-collector/
br_FR.dic
and br_FR.aff
in hunspell-dictionary
folder, from An Drouizig hunspell dictionary.