CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SingleByteCharSetProber.Reset() does not correctly reset

adimosh opened this issue · comments

Background

After forking this repository in order to customize the implementation for some of my own needs, I tried to cache the prober implementations, relying on the Reset method to make sure they're "clean" between runs. It was during running tests for this scenario that I found the bug.

What happens

Running probers multiple times results in an increased probability for certain previously-recognized charsets to have a higher confidence, and, therefore, to possibly overtake the correct encoding prober's confidence.

This is a result of the Reset() method of the SingleByteCharSetProber class, which resets state, lastOrder, seqCounters, totalSeqs, totalChar and freqChar back to their default values.

It does not, however, also reset ctrlChar to its default value. Confidence, therefore, grows slowly with each use of the prober.

Proposed solution

Add the line:

ctrlChar = 0;

...anywhere in the Reset method (possibly on line 204 of the /src/Core/Probers/SingleByteCharSetProber.cs file, for example).

Conclusion

The observed behaviour was that Windows-1250 and Windows-1252 became significantly more often-recognized than any others.

Once this is done, probers can be cached and reused, resulting in significantly fewer allocations, and less recognition bugs.

It is entirely possible that this might be the cause of a few of the issues currently outlined, like:

Could you please send a pr? (E.g. just edit the file in github and propose the pull request)