Mandarin Chinese garbled

Question

Mandarin Chinese garbled

wingdi opened this issue 3 years ago · comments

I use pre-trained model deepspeech-0.9.3-models-zh-CN.pbmm, it generated Garbled in GBK or UTF-8 ;
is there any method to repair this ?

this is the generate json :

when i print the word value:
word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94'

print(word)

it got :
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-15: surrogates not allowed

os : win10 ; python Server ; Pycharm(UTF-8)

Why this issue closed ? I didn't find solution in Discourse. and other similar questions in Discouse and here didn't get a solution. I think this is a bug , it need to be fixed ...

lissyx · Answer 1 · Mon Mar 15 2021 15:33:24 GMT+0800 (China Standard Time)

Why this issue closed ? I didn't find solution in Discourse. and other similar questions in Discouse and here didn't get a solution. I think this is a bug , it need to be fixed ...

Please follow the guidelines for reaching for support and use discourse.

lissyx · Answer 2 · Mon Mar 15 2021 16:10:40 GMT+0800 (China Standard Time)

word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94'

I'm no mandarin speaker but this looks as valid utf8 for that lang.

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-15: surrogates not allowed

This sounds more of a python / utf8 / Windows issue.

lissyx · Answer 3 · Mon Mar 15 2021 16:13:27 GMT+0800 (China Standard Time)

Since you dont care about sharing requested infos i can only speculate on the code you wrote but this seems to be your issue and valid solution https://stackoverflow.com/a/62935174

wingdi · Answer 4 · Tue Mar 16 2021 17:51:57 GMT+0800 (China Standard Time)

Since you dont care about sharing requested infos i can only speculate on the code you wrote but this seems to be your issue and valid solution https://stackoverflow.com/a/62935174

Thanks for your reply
word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94',
this is invalid to transcode to mandarin .

https://stackoverflow.com/a/62935174, this link don't give a solution.\

lissyx · Answer 5 · Tue Mar 16 2021 18:05:45 GMT+0800 (China Standard Time)

Since you dont care about sharing requested infos i can only speculate on the code you wrote but this seems to be your issue and valid solution https://stackoverflow.com/a/62935174

Thanks for your reply
word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94',
this is invalid to transcode to mandarin .

As I said, I'm no speaker, but this looks to be a valid python string.

https://stackoverflow.com/a/62935174, this link don't give a solution.\

Again, please use Discourse.

wingdi · Answer 6 · Tue Mar 16 2021 23:23:39 GMT+0800 (China Standard Time)

Since you dont care about sharing requested infos i can only speculate on the code you wrote but this seems to be your issue and valid solution https://stackoverflow.com/a/62935174

Thanks for your reply
word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94',
this is invalid to transcode to mandarin .

As I said, I'm no speaker, but this looks to be a valid python string.

https://stackoverflow.com/a/62935174, this link don't give a solution.\

Again, please use Discourse.

I am almost pretty sure '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94' not valid.
this mandarin encode range using Unicode ：

thanks.

lissyx · Answer 7 · Tue Mar 16 2021 23:32:40 GMT+0800 (China Standard Time)

Since you dont care about sharing requested infos i can only speculate on the code you wrote but this seems to be your issue and valid solution https://stackoverflow.com/a/62935174

Thanks for your reply
word = '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94',
this is invalid to transcode to mandarin .

As I said, I'm no speaker, but this looks to be a valid python string.

https://stackoverflow.com/a/62935174, this link don't give a solution.\

Again, please use Discourse.

I am almost pretty sure '\udce5\udcb8\udce5\udce5\udcb9\udcb3\udce6\udcba\udca6\udce5\udc8f\udce6\udcb1\udc89\udce5\udc94' not valid.
this mandarin encode range using Unicode ：

thanks.

Maybe. But until you follow the discourse steps to reach for support https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071:

we don't know what you did,
we don't know how you did,
other people have reported successfully using the mandarin model,
this is provided as experimental,

There's basically nothing I can do to help. Please follow the steps on Discourse.