r9y9 / nnmnkwii

Library to build speech synthesis systems designed for easy and fast prototyping.

Home Page:https://r9y9.github.io/nnmnkwii/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improved support for labels

karandwivedi42 opened this issue · comments

Hi

Thanks for writing this useful library! I am trying it from a few days and felt the need for better support for non HTS labels.

It would be good to have something like this: https://github.com/facebookresearch/loop/blob/master/utils.py#L143 which does not depend on label files and uses nltk 's cmudict to generate phonemes.

I can contribute if you guide me.

My current workaround is that I use merlin's scripts to generate test and train labels to use with your code.

Hi, thank you very much for your feedback. Yes, support for non HTS labels is great to have.

https://github.com/facebookresearch/loop/blob/master/utils.py#L143 which does not depend on label files and uses nltk 's cmudict to generate phonemes.

It seems that there's no nltk in utils.py? Could you elaborate what you want?

FWIW, the reason I started writing support for HTS labels is that merlin frontend assumes input is HTS-style labels.

Thanks!

In facebookresearch/loop, the file generate.py we can give any user sentence as input. Then it uses ntlk to generate the phonemes here.

Also, I am curious that loop does not involve durations at any point and yet is able to generate good output.

Okay, I see. In that case, isn't nltk enough? Just 10 lines of code. Also I tend to think the library should be language independent, though text2phone is highly language (phoneme dictionary) dependent. What do you think?

Yes, it is. I don't have much experience in speech, so I don't know if nltk (or similar library) supports other languages.

I am still trying to understand how facebook's loop uses text and audio features. I think that the attention mechanism allows it to work without having forced alignment, which is why the dataset gives phonemes as input and audio_features as the target even though they have different shapes.

 ('phonemes', (21,)),
 ('audio_features', (279, 63)),

This type of processing completely removes the need to include merlin/hts/htk/sptk (we can use pyworld for audio features extraction and synthesis) and nltk for text phonemes.

This sort of pipeline serves a somewhat different purpose (seq-to-seq models) from the ones in your notebooks/merlin (which have one to one mapping between input and output), but I am sure they will be a good addition to your library as both loop and parrot use somewhat similar.

What do you think?

As far as I understand correctly, loop uses raw text features similar to Tacotron. Attention mechanism learns alignment between raw text and audio features.

This type of processing completely removes the need to include merlin/hts/htk/sptk (we can use pyworld for audio features extraction and synthesis) and nltk for text phonemes.

You are right. I completely agree.

This sort of pipeline serves a somewhat different purpose from the ones in your notebooks/merlin, but I am sure they will be a good addition to your library as both loop and parrot use somewhat similar.

I plan to consider end-to-end speech synthesis paradigm in design (see #9, #3 for reference), so contributions for it are very welcome! Personally, from my experience working on Tacotron (https://github.com/r9y9/tacotron_pytorch), I didn't think nothing must-have functionality we should add, but probably I should think again more carefully and also need to look at existing code bases as you pointed out. Thank you!

I agree that nltk's cmudict can only convert words in its dictionary, which is very limiting. However, it removes the festival dependancy, which is a big plus. Is there any other way to convert text to phonemes without needing to use festival?

If you just need character-level numeric representation of text, not structural information that festival can annotate, maybe https://github.com/keithito/tacotron/tree/master/text would be enough?

In [1]: from text import sequence_to_text, text_to_sequence

In [2]: sequence = text_to_sequence("Hello world", ["english_cleaners"])

In [3]: print(sequence)
[35, 32, 39, 39, 42, 64, 50, 42, 45, 39, 31, 1]

In [4]: print(sequence_to_text(sequence))
hello world~

EDIT: oops, sorry I may have misunderstand your question. Tacotron uses char-level representation, but loop uses phoneme-level representation. Attached code won't work for phonemes.