CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potentially wrong detection of UTF8

belav opened this issue · comments

With the following file contents

public enum MeetingLocation
{
    Café,
    Restaurant
}

When the file is saved as UTF8, I get the following detections for encoding.

encoding        confidence
windows-1250    0.7516818
utf-8           0.505
windows-1252    0.3846154

When the file is saved as UTF8-BOM, then I just get a single detection

encoding        confidence
utf-8           1

Hello, could you clarify what result you want to get? And why?

Let me preface this with the fact that my knowledge of file encodings is fairly limited. I also originally implemented the code to deal with encodings in csharpier around two years ago and forget exactly what problem lead me to using UTF-unknown.

I'm using UTF-unknown to detect file encodings so I can read in the file contents properly.
With a file that has the following content

public enum MeetingLocation
{
    Café,
    Restaurant
}

If I have the file saved as UTF8, then UTF-unknown gives me the following detections

detection.EncodingName   detection.Encoding                                 detection.Confidence
windows-1250             System.Text.SBCSCodePageEncoding                   0.7516818
utf-8                    System.Text.UTF8Encoding+UTF8EncodingSealed        0.505
windows-1252             System.Text.SBCSCodePageEncoding                   0.3846154

If I read in the file contents using System.Text.SBCSCodePageEncoding then I get the following content, which is invalid c# and I am unable to parse it to a SyntaxTree

public enum MeetingLocation
{
    Café,
    Restaurant
}

I wasn't aware of the multiple detections until today, and was just using detectionResult.Detected.Encoding to read the file. I was thinking this may be an issue with UTF-unknown not properly detecting this file as UTF8 and wanted to see if that was the case before I look into other possible solutions, like trying to read in the file with other encodings if more than a single one is detected.

I did try replacing the content of tests/Data/utf-8/1.txt with the enum code and running the tests. Which resulted in the following error.

Charset detection failed for C:\projects\UTF-unknown\Tests\Data\utf-8\1.txt. Expected: utf-8, detected: windows-1250 (75.16818% confidence)
  Expected string length 5 but was 12. Strings differ at index 0.
  Expected: "utf-8", ignoring case
  But was:  "windows-1250"

If I add the enum before or after the existing text in 1.txt then the test passes.

Library uses a heuristic approach to finding encodings. The less data is input, the more likely it is that an error will be made. You can read more about it here https://www-archive.mozilla.org/projects/intl/universalcharsetdetection

Ah okay, that makes sense. I can make use of the multiple detections the library gives me and test each one. Thanks!