SingleByteCharSetProber.Reset() does not correctly reset

Question

SingleByteCharSetProber.Reset() does not correctly reset

adimosh opened this issue 2 years ago · comments

Background

After forking this repository in order to customize the implementation for some of my own needs, I tried to cache the prober implementations, relying on the Reset method to make sure they're "clean" between runs. It was during running tests for this scenario that I found the bug.

What happens

Running probers multiple times results in an increased probability for certain previously-recognized charsets to have a higher confidence, and, therefore, to possibly overtake the correct encoding prober's confidence.

This is a result of the Reset() method of the SingleByteCharSetProber class, which resets state, lastOrder, seqCounters, totalSeqs, totalChar and freqChar back to their default values.

It does not, however, also reset ctrlChar to its default value. Confidence, therefore, grows slowly with each use of the prober.

Proposed solution

Add the line:

ctrlChar = 0;

...anywhere in the Reset method (possibly on line 204 of the /src/Core/Probers/SingleByteCharSetProber.cs file, for example).

Conclusion

The observed behaviour was that Windows-1250 and Windows-1252 became significantly more often-recognized than any others.

Once this is done, probers can be cached and reused, resulting in significantly fewer allocations, and less recognition bugs.

It is entirely possible that this might be the cause of a few of the issues currently outlined, like:

#108
#38

Julian Verdurmen · Answer 1 · Fri Feb 04 2022 07:00:40 GMT+0800 (China Standard Time)

Could you please send a pr? (E.g. just edit the file in github and propose the pull request)