Do this phonemizer support mixed language?

Question

Do this phonemizer support mixed language?

JohnHerry opened this issue 9 months ago · comments

Is your feature request related to a problem? Please describe.
Is this phonemizer support language-mixed input? eg. "我想买一部iphone。"

Describe the solution you'd like
the desired output of IPA phonemes of this sentence, and make promission that thers is no syllable conflict.

Describe alternatives you've considered

Additional context
We also would like that there is a map between each of input characters and its IPAs.
eg: {"我": [IPA list of 我], "iphone": [IPA list of iphone]}

Mathieu Bernard · Answer 1 · Wed Oct 11 2023 16:51:31 GMT+0800 (China Standard Time)

Hi, phonemizer (with the espeak backend) can detect language switches mostly to English. But this is quite limited as you cannot specify which are languages, or which part of the text is in which language. See https://bootphon.github.io/phonemizer/api_reference.html, language_switch option.

$ echo '我想买一部iphone。' | phonemize -l cmn -b espeak -w '; '
[WARNING] 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "cmn" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)
[WARNING] words count mismatch on 100.0% of the lines (1/1)
wo2; ɕiɑ2ŋ; mai2; ji5; pu5; (en)aɪfəʊn(zh);

For the mapping word -> IPA, this is not implemented but already a feature request, see #96.

JohnHerry · Answer 2 · Thu Oct 12 2023 11:47:55 GMT+0800 (China Standard Time)

Hi, phonemizer (with the espeak backend) can detect language switches mostly to English. But this is quite limited as you cannot specify which are languages, or which part of the text is in which language. See https://bootphon.github.io/phonemizer/api_reference.html, language_switch option.
$ echo '我想买一部iphone。' | phonemize -l cmn -b espeak -w '; '
[WARNING] 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "cmn" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)
[WARNING] words count mismatch on 100.0% of the lines (1/1)
wo2; ɕiɑ2ŋ; mai2; ji5; pu5; (en)aɪfəʊn(zh);
For the mapping word -> IPA, this is not implemented but already a feature request, see #96.

Thanks for the help. by the way, In the output IPAs of the example, I guess it may contains the Tone symbols. but it looks strange. the output of the two character 一部（ ji5; pu5;） have the same tone "5;", but as a Mandarin native, I think they should be not. Is there any bug in the relative module?

And I have another question, Is there any way to got the full alphabeta of IPAs? we would like an IPA alphabeta desigin that support multi-lingual expression.

The third quesion, How did the phonemizer process the polyphone problem? There are a lot of multi-PinYin characters in Mandarin characters. the truly PinYin is desided by the text context where the character is in. eg: character "着" in the context "走着"， its PinYin is "zhe", but when in "着火"， its PinYin is "zhao", I thinks they should also be different with IPA transcription, How did the phonemeizer process this problem? with a LM based prediction?

Mathieu Bernard · Answer 3 · Thu Oct 12 2023 16:30:10 GMT+0800 (China Standard Time)

Your questions are all related to the espeak-ng backend, not phonemizer itself, which is a "simple" wrapper. Please go there to look for answers. For example https://github.com/espeak-ng/espeak-ng/issues?q=mandarin and https://github.com/espeak-ng/espeak-ng/blob/master/dictsource/cmn_list.
Best.

JohnHerry · Answer 4 · Thu Oct 12 2023 17:52:22 GMT+0800 (China Standard Time)

Thank you very much