Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

Question

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

cmp-nct opened this issue a year ago · comments

The readme makes it sound very simple: "Replace bert with xphonebert"
Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (#28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

Alternatively re-training the current PL-Bert for other languages, though that needs a corpus and I've no idea on the cost ?
(https://github.com/yl4579/PL-BERT)

Aaron (Yinghao) Li · Answer 1 · Mon Nov 20 2023 09:50:43 GMT+0800 (China Standard Time)

The repo so far is a research project and its main purpose serves more as a proof of concept for the paper than a full-fledged open source project. I agree that PL-BERT is the major obstacle to generalize to other languages, but training large-scale language models particularly on multiple languages can be very challenging. With the resources I have in the school, training PL-BERT on English only corpus with 3 A40 took me a month, with all the ablation studies and experiment, I spent an entire summer on this project only for a single language.

I'm not affiliated with any company and I'm only a PhD student, and the GPU resources in our lab need to be prioritized for new research projects. I don't think I will have resources to train a multi-lingual PL-BERT model at the time being, so PL-BERT probably is not the best approach to multilingual models for StyleTTS 2.

I have never tried XPhoneBERT myself, but it seems to be a promising alternative PL-BERT. The only problem of it is that it uses a different phonemizer, which can also be related to #40 . The current phonemizer was taken from VITS, which also incurs license issues (MIT vs. GPL). It would be great if someone could help to switch the phoneimzer and BERT model to things like XPhoneBERT that is compatible with MIT license and also supports multiple languages.

The basic idea is to re-train the ASR model (https://github.com/yl4579/AuxiliaryASR) using the phonemizer of XPhoneBERT, and replace PL-BERT with XPhoneBERT and re-train the model from scratch. Since the models, especially the model LibriTTS, took about 2 weeks to train on 4 A100, I do not think I have enough GPU resources to work on this for the time being. If anyone is willing to sponsor GPUs and datasets for either multilingual PL-BERT or XPhoneBERT StyleTTS 2, I'm happy to extend this project towards the multilingual directions.

John · Answer 2 · Mon Nov 20 2023 10:02:18 GMT+0800 (China Standard Time)

I think it would be doable to get the GPU time, 1 week of 8xA100 maybe in exchange of naming the resulting model after the sponsor. One of the cloud providers might be interested, or some guys from the ML discords who train a lot might have it spare.
I was offered GPU time once, could ask the guy.
But without datasets that wouldn't help
That said: If you need GPU time let me know, I'll ask

Datasets:
German: TTS dataset from a university (high quality, 6 main speakers, I think 40-50 hours of studio quality recordings)
https://opendata.iisys.de/dataset/hui-audio-corpus-german/ (https://github.com/iisys-hof/HUI-Audio-Corpus-German)
https://github.com/thorstenMueller/Thorsten-Voice (11 hours, one person)

Italian: TTS dataset, LJSpeech affiliated ?
https://huggingface.co/datasets/z-uo/female-LJSpeech-italian
https://huggingface.co/datasets/z-uo/male-LJSpeech-italian

Multilingual:
https://www.openslr.org/94/ (audiobook based libritts)
https://github.com/freds0/CML-TTS-Dataset (more than 3000 hours, CS licensed)

Sidenote: For detecing unclean audio, possibly "CLAP" from Laion could be used.

Aaron (Yinghao) Li · Answer 3 · Mon Nov 20 2023 10:09:58 GMT+0800 (China Standard Time)

Multilingual speech datasets are more difficult to get than language datasets. XPhoneBERT for example was trained entirely on Wikipedia in 100+ languages, but getting 100+ languages of speech data with transcriptions is more difficult. XTTS has multilingual supports but the data used seems private. I believe the creator @erogol was once interested in StyleTTS but did not proceed to integrate this into Coqui API for some reason. It would be great if he could help for multilingual supports. I will ping him to see if he is still interested.

John · Answer 4 · Mon Nov 20 2023 11:46:29 GMT+0800 (China Standard Time)

I found quite good datasets for Italian and German, will take another look for more. Will update the previous comment.
About how much data (length, # of speakers) is needed when training ?

Aaron (Yinghao) Li · Answer 5 · Mon Nov 20 2023 13:55:10 GMT+0800 (China Standard Time)

If you want cross-lingual generalization, I think each language should be at least 100 hours. The data you provide probably is good for a single speaker model, but not enough for zero-shot models like XTTS. It is not feasible to get a model like that with publicly available data. We probably have to rely on something like multilingual librispeech (https://www.openslr.org/94/) and use some speech restoration models to remove bad samples. This is not a single person's effort, so everyone else is welcome to contribute.

SmileSky · Answer 6 · Tue Nov 21 2023 17:56:42 GMT+0800 (China Standard Time)

It's a pity not supporting Chinese.

hobodrifterdavid · Answer 7 · Tue Nov 21 2023 21:27:59 GMT+0800 (China Standard Time)

I can make a 8x 3090 (24GB) machine available, if it's of use. 2x Xeon E5-2698 v3 cpus, 128GB ram. Alternatively: a 4x 3090 box with nvlinks, Epyc 7443p, 256GB, pcie 4.0. Send a mail to dioco@dioco.io

tosunozgun · Answer 8 · Tue Nov 21 2023 23:22:15 GMT+0800 (China Standard Time)

I can support for training turkish model, just need a help for training pl-bert for turkish wikipedia dataset.

Aaron (Yinghao) Li · Answer 9 · Wed Nov 22 2023 04:20:32 GMT+0800 (China Standard Time)

@hobodrifterdavid Thanks so much for your help. What you have now is probably good for multilingual PL-BERT training as long as you can keep this machine running for at least a couple of months or so. Just sent you an email for multilingual PL-BERT training.

Aaron (Yinghao) Li · Answer 10 · Wed Nov 22 2023 05:34:16 GMT+0800 (China Standard Time)

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. There could be potential quirks, inaccuracy or traps for certain languages. For example, phonemizer doesn't work with Japanese and Chinese directly, you have to first phonemize the grapheme into alphabets and then use phonemizer. The characters in these languages do not always have the same pronunciations depending on the context, so expertise in these languages is needed when doing NLP with them. To make sure the data preprocessing goes as smooth and accurate as possible, any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated.

Soshyant · Answer 11 · Wed Nov 22 2023 05:45:48 GMT+0800 (China Standard Time)

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. Any help from those who speaks any language in this list (or knows some linguistics about these languages) is appreciated.

I can speak Persian, Japanese and a little bit of Arabic. (Have a friend fleunt in this as well). I would very much like to help you with this.
I'm also gathering Labeled Speech data for these languages as of right now. (I have a little less than 100 hours for Persian and a bit with the other two). So, Count me in please.

Aaron (Yinghao) Li · Answer 12 · Wed Nov 22 2023 05:50:09 GMT+0800 (China Standard Time)

@SoshyHayami Thanks for your willingness to help.

Fortunately, I think most other languages that have whitespaces between words can be handled with the same logic. The only supported languages that do not have space between them are Chinese, Japanese (including Korean Hanja rarely), and Burmese. These are probably languages that need to be handled with their own logics. I can handle the first two languages, and we just need someone to handle the other two (Korean Hanja and Burmese).

SmileSky · Answer 13 · Wed Nov 22 2023 06:56:59 GMT+0800 (China Standard Time)

It would be great if it could support Chinese language! I am a native Chinese, and I don't know what help I can provide?

Aaron (Yinghao) Li · Answer 14 · Wed Nov 22 2023 07:15:53 GMT+0800 (China Standard Time)

Maybe I’ll create a new branch in the PL-BERT repo for multilingual processing scripts. Chinese and Japanese definitely needs to be processed separately with their own logics. @mzdk100 If you have some good Chinese phonemizer (Chinese characters to pinyin), you are welcome to contribute.

Soshyant · Answer 15 · Wed Nov 22 2023 07:55:24 GMT+0800 (China Standard Time)

in the case of Japanese, since it already has Kana which is basically an alphabet, can't we simply restrict it to just that for now?(Kana and Romaji should be easier to phonemize if I'm not mistaken here.)
Sorry it might be a stupid Idea but I was thinking about if we had another language model that would recognize the correct pronunciations based on the context and then would convert the text (and the converted text would be handed over to the phonemizer), maybe it could make things a bit easier here.

though It'll probably make inference a torture as well on low-performance devices.

SmileSky · Answer 16 · Wed Nov 22 2023 08:11:39 GMT+0800 (China Standard Time)

@yl4579
There are two main libraries for handling Chinese tokens, jieba and pypinyin.
Jieba is based on Chinese word segmentation mode, while pypinyin is based on Chinese pinyin mode.

pip3 install jieba pypinyin

from pypinyin import lazy_pinyin, pinyin, Style
print(pinyin('朝阳')) # [['zhāo'], ['yáng']]
print(pinyin('朝阳', heteronym=True)) # [['zhāo', 'cháo'], ['yáng']]
print(pinyin('聪明的小兔子')) # ['cong', 'ming', 'de', 'xiao', 'tu', 'zi']
print(lazy_pinyin('聪明的小兔子', style=Style.TONE3)) # ['cong1', 'ming2', 'de', 'xiao3', 'tu4', 'zi']

There are many Chinese characters, and using pinyin can greatly reduce the number of vocabulary and potentially make the model smaller.

import jieba
print(list(jieba.cut('你好，我是**人'))) # ['你好', '，', '我', '是', '**', '人']
print(list(jieba.cut_for_search('你好，我是**人'))) # ['你好', '，', '我', '是', '**', '人']

If using word segmentation mode, the model can learn more natural language features, but the Chinese vocabulary is very large, and perhaps the model will be super large, and the computational power requirements are unimaginable.
It is highly recommended to use Pinyin mode, as the converted text looks more like English without the need to change too many training codes.

print(' '.join(lazy_pinyin('聪明的小兔子', style=Style.TONE3))) # 'cong1 ming2 de xiao3 tu4 zi'

John · Answer 17 · Thu Nov 23 2023 06:42:23 GMT+0800 (China Standard Time)

If german ears are needed, I'd be happy to lend

nicognaw · Answer 18 · Thu Nov 23 2023 11:00:44 GMT+0800 (China Standard Time)

https://github.com/rime/rime-terra-pinyin/blob/master/terra_pinyin.dict.yaml

From the industrial world, this is the characters-to-pinyin solution that the well-known input method editor Rime uses.

dsplog · Answer 19 · Thu Nov 23 2023 11:08:00 GMT+0800 (China Standard Time)

any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated

keen to extend this to malayalam, dravidian language spoken in south india. will help for that.

rjrobben · Answer 20 · Fri Nov 24 2023 20:59:12 GMT+0800 (China Standard Time)

I hope Cantonese or Traditional Chinese is also considered when training the multilingual system, I can definitely help regarding this language. Is there any cooperation channel for this task?

mrfakename · Answer 21 · Sat Nov 25 2023 06:19:03 GMT+0800 (China Standard Time)

Multilingual speech datasets are more difficult to get than language datasets. XPhoneBERT for example was trained entirely on Wikipedia in 100+ languages, but getting 100+ languages of speech data with transcriptions is more difficult. XTTS has multilingual supports but the data used seems private. I believe the creator was once interested in StyleTTS but did not proceed to integrate this into Coqui API for some reason. It would be great if he could help for multilingual supports. I will ping him to see if he is still interested.

Personally, I do not support Coqui TTS. XTTS is not open-sourced according to OSI because of its ultra-restrictive license. I believe that the future of TTS lies in open-source models such as StyleTTS.

Aaron (Yinghao) Li · Answer 22 · Sat Nov 25 2023 06:30:44 GMT+0800 (China Standard Time)

@rjrobben I have created a slack channel for this multilingual PL-BERT: https://join.slack.com/t/multilingualstyletts2/shared_invite/zt-2805io6cg-0ROMhjfW9Gd_ix_FJqjGmQ

Aaron (Yinghao) Li · Answer 23 · Sat Nov 25 2023 06:31:16 GMT+0800 (China Standard Time)

Also yl4579/PL-BERT#22 this maybe helpful, if anyone could try it out.

mrfakename · Answer 24 · Sat Nov 25 2023 06:36:47 GMT+0800 (China Standard Time)

@yl4579 Thanks for making the slack channel! Are you planning to make a slack channel for general StyleTTS 2-related discussions as well? Just because GH Discussions isn't realtime?

Aaron (Yinghao) Li · Answer 25 · Sat Nov 25 2023 06:38:59 GMT+0800 (China Standard Time)

@fakerybakery I can make this channel generally StyleTTS2-related if it is better. I can change the title to StyleTTS 2 instead.

mrfakename · Answer 26 · Sat Nov 25 2023 06:39:45 GMT+0800 (China Standard Time)

Great, thanks! Maybe make one chatroom just about BERT instead?

Aaron (Yinghao) Li · Answer 27 · Sat Nov 25 2023 06:42:47 GMT+0800 (China Standard Time)

Yeah I've already done that. There's a channel about multilingual PLBERT.

mrfakename · Answer 28 · Sat Nov 25 2023 06:44:31 GMT+0800 (China Standard Time)

Great! Are you planning to add the link to the README?

Aaron (Yinghao) Li · Answer 29 · Sat Nov 25 2023 06:49:07 GMT+0800 (China Standard Time)

It expires every 30 days I don't know if there's a better to get a permanent link.

mrfakename · Answer 30 · Sat Nov 25 2023 06:53:30 GMT+0800 (China Standard Time)

I think there's a way to set it to never expire, right?

Aaron (Yinghao) Li · Answer 31 · Sat Nov 25 2023 07:41:15 GMT+0800 (China Standard Time)

Yes I did that. Added to README.

Aaron (Yinghao) Li · Answer 32 · Sat Nov 25 2023 12:49:05 GMT+0800 (China Standard Time)

It seems I couldn't get any data that was not already processed by Huggingface:

Using custom data configuration 20230701.bn-date=20230701,language=bn
Old caching folder /root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559 for dataset wikipedia exists but not data were found. Removing it. 
Downloading and preparing dataset wikipedia/20230701.bn to file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 5152.71it/s]
Extracting data files: 100%|████████████████████| 1/1 [00:00<00:00, 2211.02it/s]
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 7667.83it/s]
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/root/.local/share/jupyter/runtime/kernel-5364407a-2c52-4d34-99f7-2eb08d56bdd7.json']
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/root/.local/share/jupyter/runtime/kernel-5364407a-2c52-4d34-99f7-2eb08d56bdd7.json']
ERROR:apache_beam.runners.common:Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia-train [while running 'train/Save to parquet/Write/WriteImpl/InitializeWrite']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 851, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 997, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1961, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/iobase.py", line 1140, in <lambda>
    lambda _, sink: sink.initialize_write(), self.sink)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/options/value_provider.py", line 193, in _f
    return fnc(self, *args, **kwargs)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filebasedsink.py", line 173, in initialize_write
    tmp_dir = self._create_temp_dir(file_path_prefix)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filebasedsink.py", line 178, in _create_temp_dir
    base_path, last_component = FileSystems.split(file_path_prefix)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filesystems.py", line 151, in split
    filesystem = FileSystems.get_filesystem(path)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filesystems.py", line 103, in get_filesystem
    raise ValueError(
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia-train

If anyone knows how to deal with this problem please let me know. I have searched online and couldn't find any solution so far. Closest issue I found so far with no solution: huggingface/datasets#6147

The code I used:

from datasets import load_dataset
dataset = load_dataset('wikipedia', date="20230701", language="bn", split='train', beam_runner='DirectRunner')

Aaron (Yinghao) Li · Answer 33 · Sat Nov 25 2023 12:58:49 GMT+0800 (China Standard Time)

It seems I couldn't get any data that was not already processed by Huggingface:

Using custom data configuration 20230701.bn-date=20230701,language=bn
Old caching folder /root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559 for dataset wikipedia exists but not data were found. Removing it. 
Downloading and preparing dataset wikipedia/20230701.bn to file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 5152.71it/s]
Extracting data files: 100%|████████████████████| 1/1 [00:00<00:00, 2211.02it/s]
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 7667.83it/s]
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/root/.local/share/jupyter/runtime/kernel-5364407a-2c52-4d34-99f7-2eb08d56bdd7.json']
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/root/.local/share/jupyter/runtime/kernel-5364407a-2c52-4d34-99f7-2eb08d56bdd7.json']
ERROR:apache_beam.runners.common:Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia-train [while running 'train/Save to parquet/Write/WriteImpl/InitializeWrite']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 851, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 997, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1961, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/iobase.py", line 1140, in <lambda>
    lambda _, sink: sink.initialize_write(), self.sink)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/options/value_provider.py", line 193, in _f
    return fnc(self, *args, **kwargs)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filebasedsink.py", line 173, in initialize_write
    tmp_dir = self._create_temp_dir(file_path_prefix)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filebasedsink.py", line 178, in _create_temp_dir
    base_path, last_component = FileSystems.split(file_path_prefix)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filesystems.py", line 151, in split
    filesystem = FileSystems.get_filesystem(path)
  File "/root/anaconda3/envs/BERT/lib/python3.8/site-packages/apache_beam/io/filesystems.py", line 103, in get_filesystem
    raise ValueError(
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: file:///root/.cache/huggingface/datasets/wikipedia/20230701.bn-date=20230701,language=bn/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia-train

If anyone knows how to deal with this problem please let me know. I have searched online and couldn't find any solution so far. Closest issue I found so far with no solution: huggingface/datasets#6147

The code I used:

from datasets import load_dataset
dataset = load_dataset('wikipedia', date="20230701", language="bn", split='train', beam_runner='DirectRunner')

solved by using dataset = load_dataset('wikimedia/wikipedia', "20230701.bn", split='train'), this is a preprocessed dataset: https://huggingface.co/datasets/wikimedia/wikipedia

Aaron (Yinghao) Li · Answer 34 · Sat Nov 25 2023 13:27:19 GMT+0800 (China Standard Time)

UPDATE: Ended up git clone the subfolder and load them locally.

Can anyone download the dataset though? It keeps downloading the entire dataset which ends in failure (connection issue) and if re-run it will start from the beginning so the process will never finish.

Does anyone know how to load a subset of a single language? dataset = load_dataset('wikimedia/wikipedia', "20230701.bn", split='train') doesn't work.

Soshyant · Answer 35 · Sat Nov 25 2023 15:39:59 GMT+0800 (China Standard Time)

oh yeah, that dataset is a nightmare to load, I don't why but I only could load it with Google Colab instead of my own PC last time I tried loading it. as you mentioned, git cloning and loading them locally should work.

Aaron (Yinghao) Li · Answer 36 · Sun Nov 26 2023 04:43:00 GMT+0800 (China Standard Time)

Unfortunately the machine sponsored by @hobodrifterdavid is down. I managed to write the data preprocessing script for most languages. My lab is currently short of GPUs as we are working on some projects using LLMs. The CPUs can still be used so I'm now running the preprocessing on my lab's machines because it does not use any GPU resource. Once it is done I can upload it to more stable GPU machines that some can sponsor (if any).

mrfakename · Answer 37 · Sun Nov 26 2023 08:29:39 GMT+0800 (China Standard Time)

Colab is probably too weak, right? I think Paperspace charges around $0.50/hr for A100, not sure if that's too much

Aaron (Yinghao) Li · Answer 38 · Sun Nov 26 2023 08:39:36 GMT+0800 (China Standard Time)

@fakerybakery It is back online now but it was rebooted. I think it’s quite unstable given how often it happens (within a day I started to work on it). Colab is too expensive and also no multi-GPU support. I may just stick to this one and monitor the process when I get to training. People who have extra time on it can also help with it and ask @hobodrifterdavid for access.

Aaron (Yinghao) Li · Answer 39 · Wed Nov 29 2023 16:13:34 GMT+0800 (China Standard Time)

I have preprocessed 70 languages so far, and most look good upon manual inspections (validated using wiktionary). The only ones left are zh, zh-yue, ja, my (Chinese, Cantonese, Japanese and Burmese).

There are a few languages that are broken. If any of you speaks any of the following languages, please join the slack space and hep in the multilingual-PL-BERT channel if possible.

bn: Bengali (phonemizer seems less accurate than charsiuG2P)
cs: Czech (same as above)
hak: Hakka (tones are phonemized and has "-", need fix)
ko: Korean (has "-" for some reason for words)
ms: Malay (has "-" for some reason)
ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)
th: Thai (phonemizer totally broken)
uk: Ukrainian (phonemizer is worse than charsiuG2P)
vi: Vietnamese (has tones)

Hakka and Vietnamese seem like an easy fix, just strip all the numbers in the phonemized results are fine. Korean and Malay also seem an easy fix, but I don't know if "-" means anything for these languages and whether removing them is okay. Thai seems totally broken so it has to be handled separately just like the remaining four languages.

The rest may be fixed by charsiuG2P, but charsiuG2P can't handle numbers or dates etc., which can be problematic.

ismail-yussuf · Answer 40 · Wed Nov 29 2023 19:40:52 GMT+0800 (China Standard Time)

hey guys i'm working on a project for a TTS model that is good with Somali. I don't see that many TTS models that support Somali at all. I'm collecting high quality data for it as we speak.

@yl4579 is it fine by you guys if we also add in Somali into the mix? I believe based off of @yl4579 descriptions of good qualifiers for this Somali would work well as its written in English letters just like Spanish.

Also if we need more GPU's would me renting some cloud GPU's on run pod be beneficial? I'm willing to help out on that end as well.

mrfakename · Answer 41 · Thu Nov 30 2023 08:13:55 GMT+0800 (China Standard Time)

Hi @ismail-yussuf, we're working on adding more languages. If you're interested in this, please join the Slack channel!

Aaron (Yinghao) Li · Answer 42 · Thu Nov 30 2023 16:45:07 GMT+0800 (China Standard Time)

@yl4579 There are two main libraries for handling Chinese tokens, jieba and pypinyin. Jieba is based on Chinese word segmentation mode, while pypinyin is based on Chinese pinyin mode.
pip3 install jieba pypinyin
from pypinyin import lazy_pinyin, pinyin, Style
print(pinyin('朝阳')) # [['zhāo'], ['yáng']]
print(pinyin('朝阳', heteronym=True)) # [['zhāo', 'cháo'], ['yáng']]
print(pinyin('聪明的小兔子')) # ['cong', 'ming', 'de', 'xiao', 'tu', 'zi']
print(lazy_pinyin('聪明的小兔子', style=Style.TONE3)) # ['cong1', 'ming2', 'de', 'xiao3', 'tu4', 'zi']
There are many Chinese characters, and using pinyin can greatly reduce the number of vocabulary and potentially make the model smaller.
import jieba
print(list(jieba.cut('你好，我是**人'))) # ['你好', '，', '我', '是', '**', '人']
print(list(jieba.cut_for_search('你好，我是**人'))) # ['你好', '，', '我', '是', '**', '人']
If using word segmentation mode, the model can learn more natural language features, but the Chinese vocabulary is very large, and perhaps the model will be super large, and the computational power requirements are unimaginable. It is highly recommended to use Pinyin mode, as the converted text looks more like English without the need to change too many training codes.
print(' '.join(lazy_pinyin('聪明的小兔子', style=Style.TONE3))) # 'cong1 ming2 de xiao3 tu4 zi'

I found the quality is not very good, for example:

pinyin("他把这个还我了")

The output is:

[['tā'], ['bǎ'], ['zhè'], ['gè'], ['hái'], ['wǒ'], ['le']]

In this case "还" should be "huan" instead of "hai", which is a verb. Another case is

pinyin("不得了了")

The output is:

[['bù'], ['dé'], ['le'], ['le']]

The first "了" is in the word "得了" which is an adverb and should be read as "de liao", while the second "了" is a particle that specifies the tense. The library clearly can't tell the difference.

SmileSky · Answer 43 · Thu Nov 30 2023 21:18:02 GMT+0800 (China Standard Time)

Indeed, the output result is incorrect.

…

------------------ 原始邮件 ------------------ 发件人: "Aaron (Yinghao) ***@***.***>; 发送时间: 2023年11月30日(星期四) 下午4:45 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [yl4579/StyleTTS2] Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc) (Issue #41) @yl4579有两个主要的汉字库，解巴和拼音。解吧是基于汉语分词模式，而拼音是基于汉语拼音模式。 pip3 安装jieba Pypinyin 从拼音进口懒惰_拼音,拼音,风格新闻中心pinyin朝阳#['zh'o']，['yeng']新闻中心pinyin朝阳，heteronym=真实#[]（）新闻中心pinyin“聪明的小兔子”#['cong','ming','de','xiao','tu','zi']打印(懒惰_拼音('聪明的小兔子',风格=风格.音调3))# ['丛1'， '明2'， '德'， '小3'， 'tu4'， '子'] 有很多汉字，使用拼音可以大大减少词汇量，并有可能使模型更小。进口结霸打印(列表(结霸.切('你好，我是**人')))# ['你好', '，', '我', '是', '**', '人']打印(列表(结霸.剪切用于搜索('你好，我是**人')))# ['你好', '，', '我', '是', '**', '人'] 如果采用分词模式，模型可以学习到更多的自然语言特征，但是中文词汇量非常大，也许模型会超级大，对计算能力的要求是无法想象的。强烈建议使用拼音模式，因为转换后的文本看起来更像英文，而不需要改变太多的训练码。打印(' '.加入(懒惰_拼音('聪明的小兔子',风格=风格.音调3)))# '小3 tu4 zi的丛1 ming2' 我发现质量不是很好，例如：拼音(“他把这个还我了”) 输出的结果是： ['tā']，['bǎ']，['zhè']，['gè']，['hái']，['wǒ']，['le'] 在这种情况下“还”应该是“欢”而不是“海”。另一种情况是拼音(“不得了了”) 输出的结果是： [‘不’]，【‘得’，【得’】，【【得】】 - 直接回复这封邮件，在GitHub上查看，或取消订阅. 你收到这个是因为你被提到了。消息ID:< yl4579 / StyleTTS2 /议题/ 41 / ***@***.***和>

duchengxian · Answer 44 · Fri Dec 01 2023 09:32:06 GMT+0800 (China Standard Time)

g2pw import G2PWConverter
from multiprocessing import Process, freeze_support

if __name__ == '__main__':
    freeze_support()

    conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
    print(conv('他还把这个还我了。不得了了。'))

got better result:
[['ta1', 'hai2', 'ba3', 'zhe4', 'ge5', 'huan2', 'wo3', 'le5', None, 'bu4', 'de2', 'liao3', 'le5', None]]

duchengxian · Answer 45 · Fri Dec 01 2023 09:55:20 GMT+0800 (China Standard Time)

这个库分辨力还是不错，了了的另一个不常用用法也能区分：
小时了了，大未必佳。[['xiao3', 'shi2', 'liao3', 'liao3', None, 'da4', 'wei4', 'bi4', 'jia1', None]]

SmileSky · Answer 46 · Fri Dec 01 2023 10:01:37 GMT+0800 (China Standard Time)

Very good.

Aaron (Yinghao) Li · Answer 47 · Fri Dec 01 2023 11:07:15 GMT+0800 (China Standard Time)

@duchengxian This looks very good. I think the dataset preparation is almost done. I will upload all the data to huggingface and wait for @hobodrifterdavid to respond and set up the 8 GPU machine for training.

dsplog · Answer 48 · Sat Dec 02 2023 11:33:37 GMT+0800 (China Standard Time)

@yl4579 : can you plz take a look at yl4579/PL-BERT#27 , added the code-mods needed for support malayalam based on bert-base-multilingual-cased

Ardha · Answer 49 · Mon Dec 04 2023 11:12:19 GMT+0800 (China Standard Time)

I have preprocessed 70 languages so far, and most look good upon manual inspections (validated using wiktionary). The only ones left are zh, zh-yue, ja, my (Chinese, Cantonese, Japanese and Burmese).

There are a few languages that are broken. If any of you speaks any of the following languages, please join the slack space and hep in the multilingual-PL-BERT channel if possible.

bn: Bengali (phonemizer seems less accurate than charsiuG2P)

cs: Czech (same as above)

hak: Hakka (tones are phonemized and has "-", need fix)

ko: Korean (has "-" for some reason for words)

ms: Malay (has "-" for some reason)

ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)

th: Thai (phonemizer totally broken)

uk: Ukrainian (phonemizer is worse than charsiuG2P)

vi: Vietnamese (has tones)

Hakka and Vietnamese seem like an easy fix, just strip all the numbers in the phonemized results are fine. Korean and Malay also seem an easy fix, but I don't know if "-" means anything for these languages and whether removing them is okay. Thai seems totally broken so it has to be handled separately just like the remaining four languages.

The rest may be fixed by charsiuG2P, but charsiuG2P can't handle numbers or dates etc., which can be problematic.

is there anything that i can help to add indonesian language?

mrfakename · Answer 50 · Tue Dec 05 2023 06:30:39 GMT+0800 (China Standard Time)

Ppl are working on a Phonemizer replacement do you want indonesian?

Ardha · Answer 51 · Tue Dec 05 2023 09:50:41 GMT+0800 (China Standard Time)

Yes i want

Aaron (Yinghao) Li · Answer 52 · Tue Dec 05 2023 15:26:20 GMT+0800 (China Standard Time)

@ardha27 I think it was already included in the processed dataset and epseak IPA results are good enough.

Ardha · Answer 53 · Tue Dec 05 2023 15:37:15 GMT+0800 (China Standard Time)

Is it already pushed to current branch? Sorry, but how i can use it?

Aaron (Yinghao) Li · Answer 54 · Tue Dec 05 2023 15:43:05 GMT+0800 (China Standard Time)

@ardha27 No, it is included in the training data for multilingual PL-BERT model. The training hasn't started yet. I'm still waiting for the 8 GPU machine from @hobodrifterdavid

dsplog · Answer 55 · Mon Dec 11 2023 09:37:55 GMT+0800 (China Standard Time)

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

@yl4579 : are the changes for the subword tokenizations available?

Aaron (Yinghao) Li · Answer 56 · Tue Dec 12 2023 11:33:30 GMT+0800 (China Standard Time)

@dsplog I haven't implemented them yet. I'm done with most data preprocessing and just need people to fix the following languages. If no response for these languages before I come back from NeurIPS (Dec 18), I will proceed to training the multilingual PL-BERT. I will have to remove Thai and using phonemizer results for the following languages.

bn: Bengali (phonemizer seems less accurate than charsiuG2P)
cs: Czech (same as above)
ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)
th: Thai (phonemizer totally broken)

Gayatri Vadaparty · Answer 57 · Wed Dec 13 2023 22:58:03 GMT+0800 (China Standard Time)

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. There could be potential quirks, inaccuracy or traps for certain languages. For example, phonemizer doesn't work with Japanese and Chinese directly, you have to first phonemize the grapheme into alphabets and then use phonemizer. The characters in these languages do not always have the same pronunciations depending on the context, so expertise in these languages is needed when doing NLP with them. To make sure the data preprocessing goes as smooth and accurate as possible, any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated.

Hey, I would love to work on this. I really liked the model that you've created. I'm using it in my work, just checking with different TTS models and comparing voice overs. I've just got to know style TTS need multilingual support. I can help with Telugu language training. I know people who know Hindi as well. I'm from India.

somerandomguyontheweb · Answer 58 · Thu Dec 14 2023 20:55:14 GMT+0800 (China Standard Time)

Hi @yl4579, thank you for this awesome project. Just wanted to clarify if there are any plans to add support for Belarusian, my native tongue. Apparently espeak-ng supports it, but when I attempted to process Belarusian Wikipedia with preprocess.ipynb, I saw that the phonemization quality is rather poor: in particular, word stress is often wrong, and numbers are not expanded properly into numerals, even though the numerals are listed in be_list. Could you please let me know if there is anything I could help with, in order to add Belarusian to multilingual PL-BERT? (E.g. providing a dictionary of stress patterns for espeak-ng, improving numeral conversion rules, etc.)

iamjamilkhan · Answer 59 · Sun Dec 17 2023 02:19:19 GMT+0800 (China Standard Time)

Please add hindi support as well

Aaron (Yinghao) Li · Answer 60 · Sun Dec 17 2023 12:38:15 GMT+0800 (China Standard Time)

@somerandomguyontheweb You can join the slack channel and make the dataset yourself if you believe the espeak is bad. I will upload all the dataset I have soon.

Aaron (Yinghao) Li · Answer 61 · Sun Dec 17 2023 12:39:24 GMT+0800 (China Standard Time)

@iamjamilkhan @GayatriVadaparty Hindi and Telugu are already added in multilingual PL-BERT training. I will upload the dataset soon. You can check the quality and let me know if something needs to be fixed.

Gayatri Vadaparty · Answer 62 · Sun Dec 17 2023 16:00:55 GMT+0800 (China Standard Time)

@yl4579 Sure, I’ll do that.

Aaron (Yinghao) Li · Answer 63 · Tue Dec 19 2023 16:04:46 GMT+0800 (China Standard Time)

I have uploaded most of the data I have: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert
Please check if there's anything missing or not not ideal. To check whether the IPA is phonemized correctly for your language, you will need to decode the tokens by using https://huggingface.co/bert-base-multilingual-cased tokenizer.
If something is wrong, please let me know. I probably will start multilingual PL-BERT training early next month (Jan 2024). The list of language correspond can be found here: https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

Sanket Dhuri · Answer 64 · Wed Jan 03 2024 22:18:43 GMT+0800 (China Standard Time)

Please add Marathi support as well

Aaron (Yinghao) Li · Answer 65 · Mon Jan 08 2024 12:45:48 GMT+0800 (China Standard Time)

@SanketDhuri It is already included: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert/tree/main/mr
You may want to check the quality of this data yourself because I don't speak this language.

Antonio Calatrava · Answer 66 · Tue Jan 23 2024 22:18:17 GMT+0800 (China Standard Time)

@yl4579 Did you start the training? I may can help in Spanish (Spain) if needed.

Mohamed Khennoussi · Answer 67 · Wed Jan 24 2024 20:47:33 GMT+0800 (China Standard Time)

I am here to help with French if needed !

John · Answer 68 · Wed Jan 24 2024 22:43:13 GMT+0800 (China Standard Time)

@yl4579 Did you start the training? I may can help in Spanish (Spain) if needed.

My last status: Training of ML-PL-Bert is planned to start during January (did not start yet)
Once that is working the model itself can be trained

paulovasconcellos-hotmart · Answer 69 · Thu Jan 25 2024 21:16:52 GMT+0800 (China Standard Time)

Hello. I'm interested in helping train a PT-BR model. I have corporate resources to do so. Let me know how I can help.

philpav · Answer 70 · Mon Feb 12 2024 22:21:44 GMT+0800 (China Standard Time)

I'd love to see support for German accents like Austrian but I guess there's no dataset available.

Ander González Docasal · Answer 71 · Thu Feb 15 2024 16:37:23 GMT+0800 (China Standard Time)

I could give linguistic support in most Iberian languages: Castilian Spanish, Basque, Catalan, Asturian and Galician.
However, due to the orthographic nature of their respective scripts, using a BERT model based on text could also be enough for synthesising these languages

Abduselam Shaltu · Answer 72 · Sat Feb 17 2024 00:34:25 GMT+0800 (China Standard Time)

hello! also interested in adding support for the oromo (orm) language, espeak-ng has a phonemizer for it although it could be improved upon.

MIchael Foreston · Answer 73 · Tue Feb 20 2024 09:15:24 GMT+0800 (China Standard Time)

Any chances to include Bulgarian?

Raphael Lenain · Answer 74 · Wed Feb 28 2024 17:47:17 GMT+0800 (China Standard Time)

Hi everyone -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

Markus Toman · Answer 75 · Tue Mar 05 2024 15:11:38 GMT+0800 (China Standard Time)

Hi everyone -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

This is awesome.
Going to try it.

Unfortunately it seems we got no language embeddings so that we could really train a multilingual model with cross-lingual capabilities atm?

Raphael Lenain · Answer 76 · Tue Mar 05 2024 17:51:29 GMT+0800 (China Standard Time)

I have actually trained a model which can speak multiple languages, without the need of a language embedding. I guess the model learns implicitly, either based on the phonemisation, or based on the references, to speak with a specific accent

Markus Toman · Answer 77 · Tue Mar 05 2024 19:09:23 GMT+0800 (China Standard Time)

@rlenain interesting, yeah I assume this would work, just a little bit uncomfortable to rely on it doing the right thing when you want one voice in multiple languages.

I thought maybe I could additively augment the style embedding with some language Infos.
A bit like some early adapter models to keep English at +0 for the existing model and for the new training data in other languages add some linear layer result of a one hot encoding. Just a rough idea without much more thought yet ;)

Smithangshu · Answer 78 · Thu Apr 18 2024 02:13:03 GMT+0800 (China Standard Time)

@dsplog I haven't implemented them yet. I'm done with most data preprocessing and just need people to fix the following languages. If no response for these languages before I come back from NeurIPS (Dec 18), I will proceed to training the multilingual PL-BERT. I will have to remove Thai and using phonemizer results for the following languages.
bn: Bengali (phonemizer seems less accurate than charsiuG2P)
cs: Czech (same as above)
ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)
th: Thai (phonemizer totally broken)

I am a native Bengali speaker from India. Please let me what kind of help I can offer.

Dmytro-Shvetsov · Answer 79 · Mon May 06 2024 05:28:38 GMT+0800 (China Standard Time)

@rlenain, thank you for your awesome work!
Do I understand correctly that the multilingual PL-BERT is just a starting point to building StyleTTS2 models other than English? Or should it work with other languages out of the box? If yes, could you share insights which parts of the code should be modified for inference pipeline (e.g I assume the phonemizer for the target language, maybe the style audio to be with the speaker of target language)?

Raphael Lenain · Answer 80 · Tue May 07 2024 17:10:46 GMT+0800 (China Standard Time)

You need to further finetune or train from scratch with PL-BERT. It won't work in inference mode only. That's because if you change it, then the outputs of the PL-BERT module will not be "aligned" with other modules that expect the PL-BERT outputs as inputs.

This is generally true with any ML model -- if you change a module, then you need to further train / finetune to be able to get the model to work.

Daniel Kleissl · Answer 81 · Wed Jul 31 2024 19:14:04 GMT+0800 (China Standard Time)

I tried finetuning in German using around 1h of data and using the multilingual BERT, but even training for 50 epochs did not yield a model that could generate coherent text.

The only parameters I changed in the config_ft.yaml were:
batch_size: 2
max_len: 600

diff_epoch and joint_epoch: tried different values, but also used the standard 10 and 30.

What I find curious is that the generated speech regarding tonality and inflection sounds close to the reference, but the content is just gibberish. I thought it might be the data, but maybe someone with a little more experience in fine-tuning can tell me if this might be an issue that isn't data-related?

Also, general question: I am unsure if I need the original LibriTTS Dataset in the data folder for fine-tuning? Because the OOD_texts .txt points to nonexisting files and the way the fine-tune tutorial is written it is not clear if we just need the OOD_texts file or the files it points to as well.

Edit: So after playing around some more I decided to make my own OOD_texts file, and now at least the sentences the model generates are understandable as German. Still, the generation quality is not very high, even using 50 epochs to train. I have around 1h of audio, is this still too little?

mikhail2013ru · Answer 82 · Sat Aug 24 2024 18:25:37 GMT+0800 (China Standard Time)

Hello) Let's make it easier:

Write detailed instructions in English with an example of how to prepare a dataset?
How to properly clean from noise, which vst plug-ins are necessary for balanced sound?
What duration is needed?
How many epochs do you need?
Whether it is necessary to teach Bert separately for the required language.
Show the result of successful model training in another language to those who have already done it. And share your experience, how did you do it?
Have you been trained on a home video card or an industrial server?
I want to teach in Russian.
Is it possible to create a dataset of their similar-sounding languages?
For example, 25 hours of Russian, 25 hours of Bulgarian, and so on.
I now have 25 hours of recordings of audiobooks of a pleasant announcer's voice, can I make a voice model from just this one voice and get high quality?

SmileSky · Answer 83 · Sat Aug 24 2024 19:54:38 GMT+0800 (China Standard Time)

Suggestions can refer to gpt-so-vits This open-source implementation method for TTS in multiple languages is really great, and it can be said that it is the best Chinese TTS, including Japanese, English, and Korean.