The Real First Universal Charset Detector
A library that helps you read text from an unknown charset encoding.
Motivated bychardet
, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.
>>>>> β€οΈ Try Me Online Now, Then Adopt Me β€οΈ <<<<<
This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.
Feature | Chardet | Charset Normalizer | cChardet |
---|---|---|---|
Fast |
β |
β |
βοΈ |
Universal** |
β | βοΈ | β |
Reliable without distinguishable standards |
β | βοΈ | βοΈ |
Reliable with distinguishable standards |
βοΈ | βοΈ | βοΈ |
Free & Open |
βοΈ | βοΈ | βοΈ |
License |
LGPL-2.1 | MIT | MPL-1.1 |
Native Python |
βοΈ | βοΈ | β |
Detect spoken language |
β | βοΈ | N/A |
Supported Encoding |
30 | π 92 | 40 |
Package | Accuracy | Mean per file (ns) | File per sec (est) |
---|---|---|---|
chardet | 93.5 % | 126 081 168 ns | 7.931 file/sec |
cchardet | 97.0 % | 1 668 145 ns | 599.468 file/sec |
charset-normalizer | 97.25 % | 209 503 253 ns | 4.773 file/sec |
** : They are clearly using specific code for a specific encoding even if covering most of used one
Please β this repository if this project helped you!
Using PyPi for latest stable
pip install charset-normalizer
Or directly from dev-master for latest preview
pip install git+https://github.com/Ousret/charset_normalizer.git
This package comes with a CLI.
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
file [file ...]
The Real First Universal Charset Detector. Discover originating encoding used
on text file. Normalize text to unicode.
positional arguments:
file Filename
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display complementary information about file if any.
Stdout will contain logs about the detection process.
-a, --with-alternative
Output complementary possibilities if any. Top-level
JSON WILL be a list.
-n, --normalize Permit to normalize input file. If not set, program
does not write anything.
-m, --minimal Only output the charset detected to STDOUT. Disabling
JSON output.
-r, --replace Replace file when trying to normalize it instead of
creating a new one.
-f, --force Replace file without asking if you are sure, use this
flag with caution.
-t THRESHOLD, --threshold THRESHOLD
Define a custom maximum amount of chaos allowed in
decoded content. 0. <= chaos <= 1.
normalizer ./data/sample.1.fr.srt
π Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
{
"path": "./data/sample.1.fr.srt",
"encoding": "cp1252",
"encoding_aliases": [
"1252",
"windows_1252"
],
"alternative_encodings": [
"cp1254",
"cp1256",
"cp1258",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"iso8859_3",
"iso8859_9",
"latin_1",
"mbcs"
],
"language": "French",
"alphabets": [
"Basic Latin",
"Latin-1 Supplement"
],
"has_sig_or_bom": false,
"chaos": 0.149,
"coherence": 97.152,
"unicode_path": null,
"is_preferred": true
}
Just print out normalized text
from charset_normalizer import CharsetNormalizerMatches as CnM
print(CnM.from_path('./my_subtitle.srt').best().first())
Normalize any text file
from charset_normalizer import CharsetNormalizerMatches as CnM
try:
CnM.normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
except IOError as e:
print('Sadly, we are unable to perform charset normalization.', str(e))
Upgrade your code without effort
from charset_normalizer import detect
The above code will behave the same as chardet.
See the docs for advanced usage : readthedocs.io
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge !
I don't care about the originating charset encoding, because two different tables can produce two identical files. What I want is to get readable text, the best I can.
In a way, I'm brute forcing text decoding. How cool is that ? π
Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
- Discard all charset encoding table that could not fit the binary content.
- Measure chaos, or the mess once opened with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Finally, if there is too much match left, we measure coherence.
Wait a minute, what is chaos/mess and coherence according to YOU ?
Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to improve or rewrite it.
Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
- Not intended to work on non (human) speakable language text content. eg. crypted text.
- Language detection is unreliable when text contains two or more languages sharing identical letters.
- Not well tested with tiny content.
Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.
Copyright Β© 2019 Ahmed TAHRI @Ousret.
This project is MIT licensed.
Letter appearances frequencies used in this project Β© 2012 Denny VrandeΔiΔ