Inconsistent results between python2 and python3 versions for UTF-XX BOM

Question

Inconsistent results between python2 and python3 versions for UTF-XX BOM

byroot opened this issue 12 years ago · comments

Hi,

For al the bom-utf-*.srt files that you can find in this directory: https://github.com/byroot/pysrt/tree/master/tests/static

Charade under python 2.7 return me the correct result, but when using python 3.2 it return me a bunch of distinct latin like encoding: iso-8859-2, window-cp1252, IBM855, etc.

Note that this bug was already present in chardet2

Ian Stapleton Cordasco · Answer 1 · Fri Dec 14 2012 10:10:36 GMT+0800 (China Standard Time)

Thanks for the report (love your gravatar by the way). Do you know if chardet2 ever solved this issue? In all candor, I have a serious inkling as to what this is, but I'm probably wrong (I am actually under-educated about character encodings).

I'm going to branch off develop and I'll ping you here when I think I've covered what might be the problem.

Also, just did some research and didn't realize you were the owner of chardet2. Cheers.

Jean Boussier · Answer 2 · Fri Dec 14 2012 10:21:20 GMT+0800 (China Standard Time)

As far as I know, chardet2 did not solved the issue. And by the way I'm the owner but I've just searched a few hours in debian and arch repository to rescue the code after Mark Pilgrim's infosuicide.
I don't know much the code, I just needed that it was available on PyPi :/

But since these files have a proper BOM (a few bytes at the start of the file that declare the file encoding) and that the original chardet was able to detect them, I suppose that the BOM detection feature got lost during the rewrite or just broken.

I'll take a look at it soon to see if I'm able to fix that.

Ian Stapleton Cordasco · Answer 3 · Fri Dec 14 2012 11:01:02 GMT+0800 (China Standard Time)

I have a very strong feeling this is related to #3. My instinct is that this is related to bytes and the character comparisons. The problem with this is that there's already an issue with bytes in #2. I'm not sure how familiar you are with Mark's move to make chardet2 (the python 3 only version) but the strings received are now byte strings (at least the ones in the tests). As such if you have a which is a byte string, then a[index] is actually an int, not a string as well. Based on that, the comparisons all had to be changed to integers (and python 2 wrapped to behave likewise).

One problem I found with the transition is that Mark (in his documentation) used the transition of "\202" to 202. This is disconcerting to me since if you do ord("\202") you get 130. This has always bothered me, but the same number of the tests fail as did in either version of python. The first thing I'm going to play with is that value. Otherwise, I'm going to do a lot more research into how to really work on the bytes/unicode/etc part of this too.

The big problem with bytes/unicode/etc is that requests (@kennethreitz) passes unicode objects to charade (as he should expect to be able to). This causes TypeErrors (if I remember correctly) with the re module since we use byte literals and we're being passed unicode literals.

One might think, well just convert all input to byte strings, the issue with this is that when you call bytes() you're expected to pass an encoding. Any encoding we use will bias charade and make the library useless in essence.

If anyone has any ideas about this, I'd love to hear them.

Jean Boussier · Answer 4 · Fri Dec 14 2012 11:05:14 GMT+0800 (China Standard Time)

I also figured out a few minutes ago that the issue was basically a comparaison between bytes and str.
I was making the change right now. I'll send it to you in a few minutes. Hold on ! :)

Ian Stapleton Cordasco · Answer 5 · Fri Dec 14 2012 11:19:56 GMT+0800 (China Standard Time)

I eagerly await it.

Jean Boussier · Answer 6 · Fri Dec 14 2012 11:24:00 GMT+0800 (China Standard Time)

It's pretty simple: byroot@e5e8add

I just compare converted str constants in bytes.

But maybe I miss something. If the str compare was not a mistake then we can to both given the input type.

Ian Stapleton Cordasco · Answer 7 · Fri Dec 14 2012 11:28:55 GMT+0800 (China Standard Time)

How do the tests perform? (Also thanks for adding your tests)

Jean Boussier · Answer 8 · Fri Dec 14 2012 11:35:17 GMT+0800 (China Standard Time)

Before and after my change there is 25 failed tests. but note that I only ran them with pytohn3.2 .

I'll run them under 2.7 right now.

Jean Boussier · Answer 9 · Fri Dec 14 2012 11:38:33 GMT+0800 (China Standard Time)

Ok under python 2.7

before:
Ran 377 tests in 4.284s

FAILED (failures=1, errors=337)

after:
Ran 382 tests in 4.256s

FAILED (failures=1, errors=337)

So I don't like to say that given the result but it looks good ... :)

Ian Stapleton Cordasco · Answer 10 · Fri Dec 14 2012 11:40:12 GMT+0800 (China Standard Time)

Awesome. If they do well, send along the pull request. I'm going to have to
eventually fix those failed tests. I'm kind of swamped right now, so it's
contending for the highest priority with a few other projects.

Jean Boussier · Answer 11 · Fri Dec 14 2012 11:42:02 GMT+0800 (China Standard Time)

No problem I totally understand that.

Pull request sent.

I'll also backport that in chardet2 and update it on PyPi