Context: Why comparing chardet and charset-normalizer?

This is an experiment to compare results of chardet and charset-normalizer. The chardet package is currently (20th of May 2021) a mandatory dependency of requests and since chardet is LGPL, it makes it impossible to be used in the Apache Software Foundation projects as mandatory dependency.

Since requests is currently 3rd most-popular package used in PyPI, this excludes a lot of packages from being used in the Apache Software Foundation projects. A group of Apache Airflow PMC members attempted to convince requests maintainers to switch to a charset-normalizer which is on MIT dependency and seems to provide very similar functionality. The charset-normalizer author @Ousret helps to get it there.

Hopefully results of the tests performed and resulting fixes to charset-normalizer will make it appealing for the requests maintainer to switch to it.

Comparing encoding vs. chardet

The test performs comparison between results of ~33.000 main pages of sites (the top 1000 sites from 80 countries in the world according to Data for SEO.

Prerequisites

venv with dependencies from requirements.txt
GNU Parallel

Running Encoding comparisions

In order to run the comparison you need to:

create and switch to virtualenv using requirements.txt
have GNU Parallel installed
have a decently Powerful machine to run it on (the run runs by default 34 processes reading and running detection on the content from 30K sites, and they keep 16 core CPU busy for > 1 h)
run ./run_site_comparision.sh

Comparing performance vs. chardet

The tests here generate different kinds of big files and run comparative performance for both elapsed time and memory of processing for both chardet and charset normalizer.

Prerequisites

venv with dependencies from requirements.txt
numfmt

Running performance tests

Run ./generate_all_files.sh -> generates all files in "big_files" folder
Run comparison for selected file: ./run_file_system_comparision.sh big_files/<FILE_NAME>

More information

Credits

The PR to implement Charset was done by @ashb.
The original version of the test written by @da191 and tested on 500 top Alexa sites.
@sigmavirus24 for understanding the needs of the users and looking into it despite feature-freeze of requests
@ntaeprewitt for caring about performance, large files and fallback behaviour
Special thanks to @Ousret for super-speedy diagnosis and fixes in charset-normalizer with < 1day turnaround.

Encoding test files

URLS.csv - list of 33138 top sites from 80 countries in the world
The URLS-split/x* - the URLS.csv split into smaller chunks to paralellize the work
res/ - results of the encoding comparision tests (per chunks and combined)
- all - all URLS tested
- different - different encodings returned by chardet/charset-normalizer
- same - same encodings returned by chardet/charset-normalizer
- exceptions.txt - exception caught during charset-normalizer processing
- summary.txt - summary counts (processed urls, skipped urls, urls without encoding, exceptions)

Performance test files

big_files - here big files are stored

Compatibility with chardet when it comes to encoding

We've already identified, and the super-responsive @Ousret fixed several bugs and in charset-normalizer thanks to the tests performed:

Performance tests results comparing to chardet

Preliminary results:

19.05.2021 (charset-normalizer 1.3.9)

Seems that processing big files is the weak point of charset-normalizer. Both chardet and charset-normalizer processing time is proportional to the amount of data in the content, but charset-normalizer is roughly 20x slower than chardet.

Size	Reading file	Chardet detection	Charset-normalizer detection
16MB	5.5ms	0.17s	2.92s
32MB	11ms	0.32s	6.46s
64MB	22ms	0.64s	12.9s
128MB	44ms	1.3s	25.6s
256MB	88ms	2.6s	50.8s

Changing the parameters did not seem to have significant impact on the timing

    return CnM.from_bytes(
        data_to_detect,
        steps=10,  # Number of steps/block to extract from my_byte_str
        chunk_size=512,  # Set block size of each extraction
        threshold=0.2,  # Maximum amount of chaos allowed on first pass
        preemptive_behaviour=False,  # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding
        explain=False  # Print on screen what is happening when searching for a match
    )

21.05.2021 (charset-normalizer 1.4.0 candidate)

Encoding comparision: no exception, 83% match on no-encoding sites chardet vs. charset normalizer
Performance comparision

Seems that Chardet does very good on simple content. After some improvements in 1.4.0 candidate the difference for Charset vs. Charset normalizer for ASCII only files decreased to ~10x from ~20x (tests performed on a different machine):

Size	Reading file	Chardet detection	Charset-normalizer detection
16MB	4.5ms	0.20s	1.9s
32MB	8ms	0.45s	3.17s
64MB	14ms	0.91s	7.32s
128MB	27ms	1.8s	14.7s
256MB	51ms	3.6s	28.6s

However, things get more interesting when the files contain actual encoded characters other than only ASCII characters. Seems that chardet-normalizer is able to decode even big files but chardet performs very poor on big files containing non-ASCII characters.

Processing time of chardet seems in this case proportional to the size of the data and in case of charset-normalized it is much less linearly depending on the size. In both cases below the size where charset-normalizer starts to be faster than chardet is between 32K and 64K of data.

Detecting Polish characters:

Size	Reading file	Chardet detection	Charset-normalizer detection
4K	0.027ms	0.05s	0.28s
8K	0.036ms	0.17s	0.52s
16K	0.032ms	0.18s	0.54s
32K	0.039ms	0.35s	0.29s
64K	0.075ms	0.7s	0.68s
128K	0.087ms	1.4s	0.68s
256K	0.153ms	2.78s	0.57s
512K	0.242ms	5.63s	0.34s
1MB	0.470ms	11.2s	0.77s
2MB	0.942ms	22.2s	1.38s
4MB	1.45ms	45s	0.93s
8MB	2.39ms	90s	1.08s
16MB	4ms	180s	0.85s
32MB	8ms	- >>3m	3.17s
64MB	14ms	- >>3m	7.32s
128MB	27ms	- >>3m	14.7s
256MB	51ms	- >>3m	28.6s

Detecting Japanese characters:

Size	Reading file	Chardet detection	Charset-normalizer detection
4K	0.03ms	0.06s	0.32s
8K	0.03ms	0.23s	0.63s
16K	0.04ms	0.25s	0.63s
32K	0.04ms	0.49s	0.35s
64K	0.07ms	0.98s	0.78s
128K	0.09ms	1.97s	0.77s
256K	0.15ms	3.97s	0.65s
512K	0.27ms	7.8s	0.67s
1MB	0.59ms	15.5s	0.89s
2MB	0.95ms	31.3s	1.01s
4MB	1.53ms	63s	1.10s
8MB	2.3ms	240s	1.27s
16MB	4ms	- >>3m	1.26s
32MB	8ms	- >>3m	1.75s
64MB	14ms	- >>3m	5.46s
128MB	25ms	- >>3m	4.10s
256MB	49ms	- >>3m	8.2s

05.07.2021 (charset-normalizer 2.0.0)

Summary: charset-normalizer became really fast

ASCII characters:

Size	Chardet	Charset_normalizer
4K	0.007 s	0.018s
16MB	0.165 s	0.019s
32MB	0.332 s	0.019 s
64MB	0.648 s	0.010 s
128MB	1.293 s	0.005 s
256MB	2.579 s	0.018 s

Polish characters:

Size	Chardet	Charset_normalizer
4K	0.027s	0.053s
8K	0.054s	0.069s
16K	0.106s	0.074s
32K	0.213s	0.057s
64K	0.453s	0.086s
128K	0.847s	0.089s
256K	1.697s	0.089s
512K	3.416s	0.094s
16MB	110s (!)	0.272s

Japanese characters:

Size	Chardet	Charset_normalizer
4K	0.037s	0.172s
8K	0.074s	0.317s
16K	0.147s	0.314s
32K	0.290s	0.179s
64K	0.577s	0.399s
128K	1.178s	0.448s
256K	2.359s	0.389s
512K	4.699s	0.324s
16MB	150s (!)	0.848s

About

Languages

Language:Python 67.8%Language:Shell 32.2%