sv24-archive / charade

NO LONGER MAINTAINED. USE chardet/chardet. Fork of chardet to support Python 2 and 3 in one code base.

Home Page:https://github.com/kennethreitz/requests/issues/951

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement language constraints

honzajavorek opened this issue · comments

Merge with chared project and support more reliable detection in case language of the text is known.

(crossposting)

@honzajavorek it isn't possible to merge with chared for a few reasons:

  1. Their detection algorithms and ours are wholly different. Charade uses a two method detection scheme developed by Mozilla
  2. Best I can tell, they're storing data in .edm files which (if I remember correctly) are Adobe Dreamweaver files.
  3. They seem to be receiving funding and be working from an academic lab. They probably had to write a proposal for any funding they receive and were they to adopt the charade/chardet algorithm, this might invalidate their proposal.
  4. Their algorithm is probably faster than ours, but ours was designed to be incredibly accurate because of the combination of the two methods being used.

I appreciate the suggestion though.

Thank you for explanation.

Just for completeness' sake, guys from the project replied too and I think it is interesting reading.

Hm that's very interesting.

I think, they misunderstood me. I was saying were the projects to merge, we'd either have to use both algorithms (from chared and charade) or choose one. They'd naturally choose theirs and that's fine but they might invalidate their funding proposals if they chose charade. Of course they seem to indicate that they're no longer being funded to work on chared so that's not an issue.

While they're not continuing chared's development, I wholeheartedly intend to develop more on charade (including the languages planned in the issue tracker) and fixing the existing issues. So there's that. The main reason charade exists is kennethreitz/requests so if you can convince him to use chared instead, I'll be happy to switch gears and contribute to their efforts. I suspect, however, that Kenneth is satisfied with charade and not looking to switch.

The main reason I thought it would be nice to merge these two projects was an idea it could improve charade's way of detection. I thought charade works in a way that it tests input text against many indicators, each with some amount of priority (weight) and according the results (weighted majority) decides the best resulting detected encoding. I have no idea if it really works in this way. I thought the algorithms in chared could be merged and used as one of many indicators, probably with higher priority if input language is explicitely set by user or something like that.

I understand from your comments this is not the way it could work, that chared and charade are disjunctive projects with different algorithms, architecture and aims.

So I never thought of replacing charade by chared - I thought charade could rather exploit chared and become better :-)


And the ultimately original reason I came to this was that charade was not properly able to detect well Czech (Windows-1250, Latin-2) encodings when using Kenneth's requests, whilst chared was (surprisingly - it came from Czech university environment, so it was probably the very first language they tried).