sv24-archive / charade

NO LONGER MAINTAINED. USE chardet/chardet. Fork of chardet to support Python 2 and 3 in one code base.

Home Page:https://github.com/kennethreitz/requests/issues/951

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support GB18030

puzzlet opened this issue · comments

GB18030 is a superset of Chinese encoding GB2312, which charade already supports.

Like #10, we can support this by:

  • renaming GB2312-related classes and constants to GB18030
  • and patching the byte-sequence state machine in mbcssm.py

This, however, I will gladly rename and make note that GB2312 is superseded by GB18030 since it's actually an official standard.

We will probably also need to add to this to make sure GB18030 is covered as well since it is an expansion of GB2312 right? I don't mind replacing GB2312 in this case but I want to make sure we cover GB18030 properly.

The module currently seems to have flaws with the encoding, as it incorrectly identifies strings with non-GB2312 characters as GB2312:

# from http://lifesinger.github.com/lab/2009/loadtime/test_hubble.html
>>> x = u'(人工の測量,1~2センチメートルの誤差があるかもしれない 尺码标注仅供参考,不能作为退货理由)'
>>> x.encode('GB2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u6e2c' in position 4: illegal multibyte sequence
>>> import charade; charade.detect(x.encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

But it's not a correct GB18030 detector either, as it rules out some unusual but valid byte sequences:

# from http://zh.wikipedia.org/zh-cn/%C4%90
>>> x = u'Đ, đ(d-stroke)是标准越南语跟标准克罗地亚语和波斯尼亚语所使用的字母,但两者所表示的音位完全不同。它是越南语字母表的第 7 个字 母、克罗地亚语字母表的第 8 个字母。'
>>> charade.detect(x.encode('GB18030'))
{'confidence': 0.22369399198761675, 'encoding': 'ISO-8859-2'}
>>> charade.detect(x[4:].encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

Thanks for the extra information @puzzlet. I'm hoping to have some time for charade this weekend or next.

@puzzlet there's no chance you have an accurate frequency table for this, do you?

@sigmavirus24 no; but I'm pretty sure it's not very different from the GB2312 table, which would be a good starting point. (and I wonder how accurate original chardet's tables used to be?)

(I wonder the same thing. But just because I don't implicitly trust many things.) How do you propose adding support for the valid characters you mentioned were causing issues with the GB2312 encoding?

(Also, sorry for taking so long to get around to this. I'm still really busy, but I've worked through other projects and want to make this better.)

For the behaviour of the detector, we can add GB18030 as a new, separate encoding as we did in #13 .

And the issue described above is the bug of the byte sequence state machine (mbcssm.py) -- we need to examine the table for GB2312, and add a new one for GB18030.

Sounds like a good start.

The fact that as late as September of last year, Mozilla doesn't have a table for GB18030, is depressing. Here's everything they have: https://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/

I might write something to quickly strip the C++ and replace everything with python so it will be a simple way of converting their *.tab files to our familiar *.py files.