SabakiHQ / sgf

A library for parsing SGF files.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with Displaying Chinese Characters

MacErlang opened this issue · comments

Hello,

I am using an iMac, and Sabaki seems to have difficulty displaying Chinese characters properly.
This often occurs with sgf files that I downloaded from the internet. Following comments from
another thread, I have tried to add CA[GB2312] to the file, but it did not work.
A sample file is given below. Can someone enlighten me with a solution to this?

Many thanks in advance,
Shun

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[£®“ª£©∆À°¢µπ∆À”Γ™µπ∆À
1£Æ∆À
     ∞◊∆‘⁄∫⁄∆Âœ»œ¬µƒ ±∫Ú£¨…˙À¿»Á∫Œƒÿ£ø
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[µƒ∫⁄1µ„£¨∞◊2∂• ±£¨∫⁄3º¥ «∆À£¨];W[rc]C[∞◊4÷ªµ√÷£¨∫⁄5≥‘£¨∞◊∆±ª…±°£]
;B[se]C[[“™µ„\]£∫À¿ªÓŒ Ã‚÷Æ÷–”√µΩ°∞∆À°±µƒµÿ∑Ω∫‹∂‡£¨∆À « π∂‘∑Ωµƒ—€±‰≥…ºŸ—€µƒ ÷∂Œ°£]
N[“™µ„])

The actual file is attached below, with added .txt file extension:

__Vs__9.sgf.txt

I don't think your file is encoded in GB2312. I've tried opening your file with that encoding in a text editor and got this:

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[拢庐鈥溌?拢漏鈭喢€掳垄碌蟺鈭喢€鈥澝庘€溾劉碌蟺鈭喢€
1拢脝鈭喢€
     鈭炩棅鈭喢傗€樷亜鈭?鈦勨垎脗艙禄艙卢碌茠聽卤鈭?脷拢篓鈥λ櫭€驴禄脕鈭?艗茠每拢酶
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[碌茠鈭?鈦?1碌鈥灺Bㄢ垶鈼?2鈭傗€⒙犅甭Bㄢ埆鈦?3潞楼聽芦鈭喢€拢篓];W[rc]C[鈭炩棅4梅陋碌鈭毭兟仿Bㄢ埆鈦?5鈮モ€樎Bㄢ垶鈼娾垎脗卤陋鈥β甭奥?]
;B[se]C[[鈥溾劉碌鈥瀄]拢鈭?脌驴陋脫艗聽脙鈥毭访喢封€撯€濃垰碌惟掳鈭炩垎脌掳卤碌茠碌每鈭懳┾埆鈥光垈鈥÷Bㄢ垎脌聽芦聽蟺鈭傗€樷垜惟碌茠鈥斺偓卤鈥扳墺鈥β号糕€斺偓碌茠聽梅鈭偱捖奥?]
N[鈥溾劉碌鈥瀅)

Not only is this complete gibberish, this is not even valid SGF (in the last line, it's missing a ]).

Couldn't find a valid encoding for this, it may have been corrupted somehow before you got it.

Folks,

I am grateful for your prompt reply/help. I was a bit hasty in the previous post.
I have now found out that the character encoding is GB18030, not GB2312 (which I took
from an earlier related thread). I have the same issue with numerous files. Attached is a
new zip file, containing three files: Test-Original.sgf, Test-GB18030.sgf, and Test-Unix.sgf.
As the names suggest, the first file is the original, which does not display properly in
Sabaki (Version 0.43.3); the second is a revision of the first, obtained by inserting
CA[GB18030] in the first line; and the third is in Unix format, to be explained below.

As you will see, the Chinese characters in the first file are scrambled.

The second file does display Chinese characters correctly in Sabaki. This is good news.
However, it is tedious to make such a revision for a ton of files. So, it appears that Sabaki
does not "automatically" recognize GB18030 characters. I am a novice, and hence am wondering
whether a "simple" remedy exists for this?

The third file was produced by the following process. First, I used BBEdit to create a new, empty
text file, which is by default using the UTF-8 character set and Unix line endings. (Note that
the first two files use ISO-8859-1 character set and Windows line endings.) Then, I dragged the
original file into a Microsoft Edge or Google Chrome browser window. It turns out that, for both
browsers, the Chinese characters in the original file DO get displayed properly! (This does NOT
work for Safari.) So, it appears that these two browsers are able to automatically detect the
GB18030 characters and hence display them properly. Finally, I just copied and pasted the
(legible) browser content into the empty file created by BBEdit and saved it as Test-Unix.sgf.
This resulting file also opens properly in Sabaki without adding any character declaration, as
Sabaki apparently detects UTF-8 file format automatically. Thus, the third file also works, but
it is even more tedious.

So, the question is what might be a "painless" solution? For example, can Sabaki be made to recognize and properly display GB18030 characters? This would be highly desirable because
I have found that numerous sgf files on the net have this issue (perhaps because they were
produced by old Windows programs.)

Your comments and help are again greatly appreciated.

Best,
Shun

Test.zip

This probably stems from the fact that we only consider the first 100 bytes for character encoding detection which in this case does not contain enough Chinese characters. When applying jschardet on the entire buffer, it correctly detects it as GB2312.

@fohristiwhirl I believe you introduced the buffer limit. Can you explain your rationale behind it?

I forget. I think the point might have been that SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd] etc etc etc, but the start of the file is more likely to contain names and such.

I seem to recall this was more of an issue for other file formats. e.g. NGF.

If possible, maybe detect charset using some aggregated comments, metadata etc, e.g. tags C, PW, PB, that sort of thing, joined together into a single string?

@yishn I have checked a few other files, and your assessment seems valid.

Hello,

Just downloaded and installed the new version, and this problem has not been resolved.
In fact, the new version won't even properly display a file that has been explicitly declared to
be GB2312 encoded. Don't know what is going on? Please help!

I have attached two files, one is original, which won't display properly for either 4.4.3 or 5.0,
and the other has an added GB2312 declaration, which would then load properly for 4.3.3, but
NOT for 5.0.

Thanks,
Shun-Chen

1020 Test.zip

Weird, the file with the added GB2312 declaration loads fine for me.

@yishn I have tested several other files with added CA[GB2312], and they all do not
display properly. No idea why my installation is different, as my Mac binary is downloaded
from the release link.

Also, how do I test your new commit with an increased buffer size? Do I need to compile
Sabaki myself? Thanks for your help.

@yishn I have compiled Sabaki myself, and the issue persists.
I used the commands:

git clone https://github.com/SabakiHQ/Sabaki
cd Sabaki
npm install
npm run build

The compilation seemed to have worked fine. The executable and a screen dump are here:

https://www.dropbox.com/s/uqtyg9p0uwmos4i/Sabaki%20Compile.zip?dl=0

Thanks for your help,
Shun-Chen

After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed?

@yishn Sorry to bother you again, but the new version still seems to have issues.
I am attaching two files, one named Original.sgf and the other GB2312.sgf. The Original
does not have any character declaration, and the other one has. What appears to be an
anomaly is that Original.sgf loads fine into 43.3 but does NOT display properly in 50.1. The
file GB2312.sgf does load fine in both versions of Sabaki.

So, there appears to be a discrepancy between the two Sabaki versions. The attached file
is fairly simple, so this is rather strange. Any ideas?

Best,
Shun

Sample.zip

Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer.

In trying to figure out what might have gone wrong, I have inspected lots of files, using the
newly compiled version. What is really weird is that the detection does not seem to be consistent.
I have attached two files, one labeled as Good and the other as Bad. As far as I could tell, the two
files are essentially identical in form, and yet one displays properly and the other does not.
Don't know if this might help you pin down the issue or not.

Shun

Samples.zip

Let me add that both files in Samples.zip load properly in 43.3.

In v0.43.3, we're guessing encoding based on the first 100 bytes of the files. After extending the encoding guessing to the first 1000 bytes of the file, it doesn't guess GB2312 anymore because, as @fohristiwhirl pointed out, "SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd]".

If we restrict ourselves to the first 100 bytes again, your original file would have issues, because in there, the first 100 bytes doesn't contain any Chinese.

For short term, we can probably just pick something between 100 bytes and 1000 bytes and guess encoding based on that. For long term, we should let the user pick their own encoding.

Does it make sense to let the detection scheme focus on the C[], PB[], and PW[] fields? These are areas where different encoding might make a difference (especially C[]).

Yes, that was what we were doing before, using spliced test buffers. But that doesn't work as evidenced by your previous samples. The detected encodings on the spliced test buffers were completely wrong.

@yishn I have tested some more files, and I am attaching four of them. These are all original
files I downloaded from the web; the only exception is that I added GB[2312] into the file
"倒垂莲(共二变)- GB2312.sgf". All of these files have trouble with either 50.1 or 43.3, in that
most would hang Sabaki. However, I have found out that they ALL load just fine in Version 35.1.
So, I am wondering if the "older" scheme used in 35.1 might be more robust than what has been
implemented in the recent versions?

BTW, I compiled the latest version, but noticed that the new option on user encoding selection
failed to commit properly.

Best,
Shun

Test Files.zip

This has nothing to do with encoding, so please open a new issue on Sabaki's repository about the hanging. FYI the new option on user encoding selection is not implemented, it's an open issue, please subscribe to it for updates.

Thanks, just posted there.