Issue with Displaying Chinese Characters

Question

Issue with Displaying Chinese Characters

MacErlang opened this issue 4 years ago · comments

Hello,

I am using an iMac, and Sabaki seems to have difficulty displaying Chinese characters properly.
This often occurs with sgf files that I downloaded from the internet. Following comments from
another thread, I have tried to add CA[GB2312] to the file, but it did not work.
A sample file is given below. Can someone enlighten me with a solution to this?

Many thanks in advance,
Shun

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[£®“ª£©∆À°¢µπ∆À”Î“™µπ∆À
1£Æ∆À
     ∞◊∆Â‘⁄∫⁄∆Âœ»œ¬µƒ ±∫Ú£¨…˙À¿»Á∫Œƒÿ£ø
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[µƒ∫⁄1µ„£¨∞◊2∂• ±£¨∫⁄3º¥ «∆À£¨];W[rc]C[∞◊4÷ªµ√Ã·£¨∫⁄5≥‘£¨∞◊∆Â±ª…±°£]
;B[se]C[[“™µ„\]£∫À¿ªÓŒ Ã‚÷Æ÷–”√µΩ°∞∆À°±µƒµÿ∑Ω∫‹∂‡£¨∆À « π∂‘∑Ωµƒ—€±‰≥…ºŸ—€µƒ ÷∂Œ°£]
N[“™µ„])

The actual file is attached below, with added .txt file extension:

__Vs__9.sgf.txt

Yichuan Shen · Answer 1 · Sun Feb 02 2020 06:51:16 GMT+0800 (China Standard Time)

I don't think your file is encoded in GB2312. I've tried opening your file with that encoding in a text editor and got this:

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[拢庐鈥溌?拢漏鈭喢€掳垄碌蟺鈭喢€鈥澝庘€溾劉碌蟺鈭喢€
1拢脝鈭喢€
     鈭炩棅鈭喢傗€樷亜鈭?鈦勨垎脗艙禄艙卢碌茠聽卤鈭?脷拢篓鈥λ櫭€驴禄脕鈭?艗茠每拢酶
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[碌茠鈭?鈦?1碌鈥灺Ｂㄢ垶鈼?2鈭傗€⒙犅甭Ｂㄢ埆鈦?3潞楼聽芦鈭喢€拢篓];W[rc]C[鈭炩棅4梅陋碌鈭毭兟仿Ｂㄢ埆鈦?5鈮モ€樎Ｂㄢ垶鈼娾垎脗卤陋鈥β甭奥?]
;B[se]C[[鈥溾劉碌鈥瀄]拢鈭?脌驴陋脫艗聽脙鈥毭访喢封€撯€濃垰碌惟掳鈭炩垎脌掳卤碌茠碌每鈭懳┾埆鈥光垈鈥÷Ｂㄢ垎脌聽芦聽蟺鈭傗€樷垜惟碌茠鈥斺偓卤鈥扳墺鈥β号糕€斺偓碌茠聽梅鈭偱捖奥?]
N[鈥溾劉碌鈥瀅)

Not only is this complete gibberish, this is not even valid SGF (in the last line, it's missing a ]).

rooklift · Answer 2 · Sun Feb 02 2020 08:01:17 GMT+0800 (China Standard Time)

Couldn't find a valid encoding for this, it may have been corrupted somehow before you got it.

MacErlang · Answer 3 · Sun Feb 02 2020 14:16:02 GMT+0800 (China Standard Time)

Folks,

I am grateful for your prompt reply/help. I was a bit hasty in the previous post.
I have now found out that the character encoding is GB18030, not GB2312 (which I took
from an earlier related thread). I have the same issue with numerous files. Attached is a
new zip file, containing three files: Test-Original.sgf, Test-GB18030.sgf, and Test-Unix.sgf.
As the names suggest, the first file is the original, which does not display properly in
Sabaki (Version 0.43.3); the second is a revision of the first, obtained by inserting
CA[GB18030] in the first line; and the third is in Unix format, to be explained below.

As you will see, the Chinese characters in the first file are scrambled.

The second file does display Chinese characters correctly in Sabaki. This is good news.
However, it is tedious to make such a revision for a ton of files. So, it appears that Sabaki
does not "automatically" recognize GB18030 characters. I am a novice, and hence am wondering
whether a "simple" remedy exists for this?

The third file was produced by the following process. First, I used BBEdit to create a new, empty
text file, which is by default using the UTF-8 character set and Unix line endings. (Note that
the first two files use ISO-8859-1 character set and Windows line endings.) Then, I dragged the
original file into a Microsoft Edge or Google Chrome browser window. It turns out that, for both
browsers, the Chinese characters in the original file DO get displayed properly! (This does NOT
work for Safari.) So, it appears that these two browsers are able to automatically detect the
GB18030 characters and hence display them properly. Finally, I just copied and pasted the
(legible) browser content into the empty file created by BBEdit and saved it as Test-Unix.sgf.
This resulting file also opens properly in Sabaki without adding any character declaration, as
Sabaki apparently detects UTF-8 file format automatically. Thus, the third file also works, but
it is even more tedious.

So, the question is what might be a "painless" solution? For example, can Sabaki be made to recognize and properly display GB18030 characters? This would be highly desirable because
I have found that numerous sgf files on the net have this issue (perhaps because they were
produced by old Windows programs.)

Your comments and help are again greatly appreciated.

Best,
Shun

Test.zip

Yichuan Shen · Answer 4 · Sun Feb 02 2020 16:59:50 GMT+0800 (China Standard Time)

This probably stems from the fact that we only consider the first 100 bytes for character encoding detection which in this case does not contain enough Chinese characters. When applying jschardet on the entire buffer, it correctly detects it as GB2312.

@fohristiwhirl I believe you introduced the buffer limit. Can you explain your rationale behind it?

rooklift · Answer 5 · Sun Feb 02 2020 22:23:07 GMT+0800 (China Standard Time)

I forget. I think the point might have been that SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd] etc etc etc, but the start of the file is more likely to contain names and such.

I seem to recall this was more of an issue for other file formats. e.g. NGF.

If possible, maybe detect charset using some aggregated comments, metadata etc, e.g. tags C, PW, PB, that sort of thing, joined together into a single string?

MacErlang · Answer 6 · Mon Feb 03 2020 00:19:13 GMT+0800 (China Standard Time)

@yishn I have checked a few other files, and your assessment seems valid.

MacErlang · Answer 7 · Tue Feb 04 2020 11:25:49 GMT+0800 (China Standard Time)

Great. Does this mean that the fix will be in the next release? Thanks, Shun

…

On Feb 3, 2020, at 8:50 AM, Yichuan Shen ***@***.***> wrote: Closed #7 via 742d8d1. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · Answer 8 · Fri Mar 13 2020 00:19:01 GMT+0800 (China Standard Time)

Hello,

Just downloaded and installed the new version, and this problem has not been resolved.
In fact, the new version won't even properly display a file that has been explicitly declared to
be GB2312 encoded. Don't know what is going on? Please help!

I have attached two files, one is original, which won't display properly for either 4.4.3 or 5.0,
and the other has an added GB2312 declaration, which would then load properly for 4.3.3, but
NOT for 5.0.

Thanks,
Shun-Chen

1020 Test.zip

Yichuan Shen · Answer 9 · Fri Mar 13 2020 00:56:46 GMT+0800 (China Standard Time)

Weird, the file with the added GB2312 declaration loads fine for me.

MacErlang · Answer 10 · Fri Mar 13 2020 04:10:41 GMT+0800 (China Standard Time)

@yishn I have tested several other files with added CA[GB2312], and they all do not
display properly. No idea why my installation is different, as my Mac binary is downloaded
from the release link.

Also, how do I test your new commit with an increased buffer size? Do I need to compile
Sabaki myself? Thanks for your help.

Yichuan Shen · Answer 11 · Fri Mar 13 2020 05:36:10 GMT+0800 (China Standard Time)

Yes, you'd need to build Sabaki yourself.

MacErlang · Answer 12 · Fri Mar 13 2020 07:43:03 GMT+0800 (China Standard Time)

@yishn I have compiled Sabaki myself, and the issue persists.
I used the commands:

git clone https://github.com/SabakiHQ/Sabaki
cd Sabaki
npm install
npm run build

The compilation seemed to have worked fine. The executable and a screen dump are here:

https://www.dropbox.com/s/uqtyg9p0uwmos4i/Sabaki%20Compile.zip?dl=0

Thanks for your help,
Shun-Chen

MacErlang · Answer 13 · Fri Mar 13 2020 07:47:45 GMT+0800 (China Standard Time)

Hi there, Shun-Chen Niu (scniu@sbcglobal.net) invited you to view the file " Sabaki Compile.zip " on Dropbox. View file[1] Enjoy! The Dropbox team Shun-Chen and others will be able to see when you view this file. Other files shared with you through Dropbox may also show this info. Learn more[2] in our help center. [1]: https://www.dropbox.com/l/scl/AAB2c78RKHqDY_CUsWOj4Fk8aUUJgM0g_QA [2]: https://www.dropbox.com/l/AADIYPvImROQv58Cm-ifer-7tcC-wy5Gr1w

Yichuan Shen · Answer 14 · Fri Mar 13 2020 08:18:40 GMT+0800 (China Standard Time)

After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed?

MacErlang · Answer 15 · Fri Mar 13 2020 09:29:39 GMT+0800 (China Standard Time)

Just compiled again, and it now works fine with and WITHOUT the GB2312 declaration! Great detective work and many thanks.

…

On Mar 12, 2020, at 7:18 PM, Yichuan Shen ***@***.***> wrote: After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · Answer 16 · Sun Mar 15 2020 10:09:22 GMT+0800 (China Standard Time)

@yishn Sorry to bother you again, but the new version still seems to have issues.
I am attaching two files, one named Original.sgf and the other GB2312.sgf. The Original
does not have any character declaration, and the other one has. What appears to be an
anomaly is that Original.sgf loads fine into 43.3 but does NOT display properly in 50.1. The
file GB2312.sgf does load fine in both versions of Sabaki.

So, there appears to be a discrepancy between the two Sabaki versions. The attached file
is fairly simple, so this is rather strange. Any ideas?

Best,
Shun

Sample.zip

Yichuan Shen · Answer 17 · Sun Mar 15 2020 18:30:06 GMT+0800 (China Standard Time)

Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer.

MacErlang · Answer 18 · Sun Mar 15 2020 23:08:57 GMT+0800 (China Standard Time)

Just compiled after the new commit. The file Original.sgf still does not display properly.

…

On Mar 15, 2020, at 5:30 AM, Yichuan Shen ***@***.***> wrote: Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · Answer 19 · Mon Mar 16 2020 00:20:09 GMT+0800 (China Standard Time)

In trying to figure out what might have gone wrong, I have inspected lots of files, using the
newly compiled version. What is really weird is that the detection does not seem to be consistent.
I have attached two files, one labeled as Good and the other as Bad. As far as I could tell, the two
files are essentially identical in form, and yet one displays properly and the other does not.
Don't know if this might help you pin down the issue or not.

Shun

Samples.zip

MacErlang · Answer 20 · Mon Mar 16 2020 00:24:32 GMT+0800 (China Standard Time)

Let me add that both files in Samples.zip load properly in 43.3.

Yichuan Shen · Answer 21 · Mon Mar 16 2020 00:56:28 GMT+0800 (China Standard Time)

In v0.43.3, we're guessing encoding based on the first 100 bytes of the files. After extending the encoding guessing to the first 1000 bytes of the file, it doesn't guess GB2312 anymore because, as @fohristiwhirl pointed out, "SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd]".

If we restrict ourselves to the first 100 bytes again, your original file would have issues, because in there, the first 100 bytes doesn't contain any Chinese.

For short term, we can probably just pick something between 100 bytes and 1000 bytes and guess encoding based on that. For long term, we should let the user pick their own encoding.

MacErlang · Answer 22 · Mon Mar 16 2020 01:16:34 GMT+0800 (China Standard Time)

Does it make sense to let the detection scheme focus on the C[], PB[], and PW[] fields? These are areas where different encoding might make a difference (especially C[]).

Yichuan Shen · Answer 23 · Mon Mar 16 2020 01:28:28 GMT+0800 (China Standard Time)

Yes, that was what we were doing before, using spliced test buffers. But that doesn't work as evidenced by your previous samples. The detected encodings on the spliced test buffers were completely wrong.

MacErlang · Answer 24 · Mon Mar 16 2020 09:53:20 GMT+0800 (China Standard Time)

Hi there, In case you missed it, Shun-Chen Niu (scniu@sbcglobal.net) shared "Sabaki Compile.zip" with you on Dropbox. View on Dropbox[1] Thanks! - The Dropbox Team

…

____________________________________________________ Dropbox, Inc. PO Box 77767, San Francisco, CA 94107 View Privacy Policy[2] | Unsubscribe[3] [1]: https://www.dropbox.com/l/scl/AAD2z69yY2aSZz_8bRLoXS7DWMG9o4GsLHs [2]: https://www.dropbox.com/l/AADecDnqKGixOtpidMGJUpdgJ6_FUwayCvE/privacy#privacy [3]: https://www.dropbox.com/l/AAA1Z1pc9TTCMt0_yh6h7Dd0Is_yn0wRshE

MacErlang · Answer 25 · Wed Mar 18 2020 06:52:44 GMT+0800 (China Standard Time)

@yishn I have tested some more files, and I am attaching four of them. These are all original
files I downloaded from the web; the only exception is that I added GB[2312] into the file
"倒垂莲（共二变）- GB2312.sgf". All of these files have trouble with either 50.1 or 43.3, in that
most would hang Sabaki. However, I have found out that they ALL load just fine in Version 35.1.
So, I am wondering if the "older" scheme used in 35.1 might be more robust than what has been
implemented in the recent versions?

BTW, I compiled the latest version, but noticed that the new option on user encoding selection
failed to commit properly.

Best,
Shun

Test Files.zip

Yichuan Shen · Answer 26 · Wed Mar 18 2020 08:03:29 GMT+0800 (China Standard Time)

This has nothing to do with encoding, so please open a new issue on Sabaki's repository about the hanging. FYI the new option on user encoding selection is not implemented, it's an open issue, please subscribe to it for updates.

MacErlang · Answer 27 · Wed Mar 18 2020 09:17:17 GMT+0800 (China Standard Time)

Thanks, just posted there.