Support GB18030

Question

Support GB18030

puzzlet opened this issue 11 years ago · comments

정 경훈 (Kyung-hown Chung) commented 11 years ago

GB18030 is a superset of Chinese encoding GB2312, which charade already supports.

Like #10, we can support this by:

renaming GB2312-related classes and constants to GB18030
and patching the byte-sequence state machine in mbcssm.py

Ian Stapleton Cordasco · Answer 1 · Thu Jan 24 2013 22:32:17 GMT+0800 (China Standard Time)

This, however, I will gladly rename and make note that GB2312 is superseded by GB18030 since it's actually an official standard.

Ian Stapleton Cordasco · Answer 2 · Fri Jan 25 2013 13:17:10 GMT+0800 (China Standard Time)

We will probably also need to add to this to make sure GB18030 is covered as well since it is an expansion of GB2312 right? I don't mind replacing GB2312 in this case but I want to make sure we cover GB18030 properly.

정 경훈 (Kyung-hown Chung) · Answer 3 · Thu Jan 31 2013 23:32:03 GMT+0800 (China Standard Time)

The module currently seems to have flaws with the encoding, as it incorrectly identifies strings with non-GB2312 characters as GB2312:

# from http://lifesinger.github.com/lab/2009/loadtime/test_hubble.html
>>> x = u'(人工の測量,1～2センチメートルの誤差があるかもしれない 尺码标注仅供参考，不能作为退货理由）'
>>> x.encode('GB2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u6e2c' in position 4: illegal multibyte sequence
>>> import charade; charade.detect(x.encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

But it's not a correct GB18030 detector either, as it rules out some unusual but valid byte sequences:

# from http://zh.wikipedia.org/zh-cn/%C4%90
>>> x = u'Đ, đ（d-stroke）是标准越南语跟标准克罗地亚语和波斯尼亚语所使用的字母，但两者所表示的音位完全不同。它是越南语字母表的第 7 个字 母、克罗地亚语字母表的第 8 个字母。'
>>> charade.detect(x.encode('GB18030'))
{'confidence': 0.22369399198761675, 'encoding': 'ISO-8859-2'}
>>> charade.detect(x[4:].encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

Ian Stapleton Cordasco · Answer 4 · Thu Jan 31 2013 23:42:41 GMT+0800 (China Standard Time)

Thanks for the extra information @puzzlet. I'm hoping to have some time for charade this weekend or next.

Ian Stapleton Cordasco · Answer 5 · Tue Feb 26 2013 11:53:48 GMT+0800 (China Standard Time)

@puzzlet there's no chance you have an accurate frequency table for this, do you?

정 경훈 (Kyung-hown Chung) · Answer 6 · Tue Feb 26 2013 12:09:49 GMT+0800 (China Standard Time)

@sigmavirus24 no; but I'm pretty sure it's not very different from the GB2312 table, which would be a good starting point. (and I wonder how accurate original chardet's tables used to be?)

Ian Stapleton Cordasco · Answer 7 · Tue Feb 26 2013 12:13:07 GMT+0800 (China Standard Time)

(I wonder the same thing. But just because I don't implicitly trust many things.) How do you propose adding support for the valid characters you mentioned were causing issues with the GB2312 encoding?

(Also, sorry for taking so long to get around to this. I'm still really busy, but I've worked through other projects and want to make this better.)

정 경훈 (Kyung-hown Chung) · Answer 8 · Tue Feb 26 2013 12:22:55 GMT+0800 (China Standard Time)

For the behaviour of the detector, we can add GB18030 as a new, separate encoding as we did in #13 .

And the issue described above is the bug of the byte sequence state machine (mbcssm.py) -- we need to examine the table for GB2312, and add a new one for GB18030.

Ian Stapleton Cordasco · Answer 9 · Tue Feb 26 2013 12:41:34 GMT+0800 (China Standard Time)

Sounds like a good start.

Ian Stapleton Cordasco · Answer 10 · Fri Mar 22 2013 00:12:52 GMT+0800 (China Standard Time)

The fact that as late as September of last year, Mozilla doesn't have a table for GB18030, is depressing. Here's everything they have: https://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/

I might write something to quickly strip the C++ and replace everything with python so it will be a simple way of converting their *.tab files to our familiar *.py files.