CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File detected as Windows-1250, but is UTF-8

tobbi opened this issue · comments

I'm using UTF.Unknown 2.3.0
The following file is detected as Windows-1250, but is UTF-8:

csv_test_correct_GZ.zip

Hello, @tobbi !

Thank you for the report.

Could you add a text file? Why did you choose zip? Do you submit this to input?

Sorry, my bad, it used to be a csv file and github wouldn't accept those. Here's the file with the extension changed to .txt:

csv_test_correct_GZ.txt

Thanks for clarifying.

At first glance, I think the result is normal. Why? The algorithm by which detected is statistical, and, accordingly, the more different input data, the more accurate the final result. Details can be found in the "A composite approach to language/encoding detection" article.

But, we need to try to improve the result :)


Status Logs:

SBCS: Detected windows-1250 with confidence of 0.7738685

Get confidence:
-- new match found: confidence 0.01, index 0, charset windows-1251.
-- new match found: confidence 0.18598664, index 6, charset iso-8859-7.
-- new match found: confidence 0.7133932, index 15, charset iso-8859-1.
-- new match found: confidence 0.71340704, index 18, charset iso-8859-1.
-- new match found: confidence 0.76677626, index 23, charset iso-8859-1.
-- new match found: confidence 0.7738685, index 86, charset windows-1250.
Get confidence done.
SBCS Group Prober --------begin status
SBCS 0.01: [windows-1251]
SBCS: 0.01 [windows-1251]

SBCS 0.01: [koi8-r]
SBCS: 0.01 [koi8-r]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.01: [x-mac-cyrillic]
SBCS: 0.01 [x-mac-cyrillic]

SBCS 0.01: [ibm866]
SBCS: 0.01 [ibm866]

SBCS 0.01: [ibm855]
SBCS: 0.01 [ibm855]

SBCS 0.18598664: [iso-8859-7]
SBCS: 0.1859866 [iso-8859-7]

SBCS 0.18598664: [windows-1253]
SBCS: 0.1859866 [windows-1253]

SBCS 0: [iso-8859-5]
SBCS: 0.00 [iso-8859-5]

SBCS 0.01: [windows-1251]
SBCS: 0.01 [windows-1251]

SBCS 0: [windows-1255]
HEB: 0 - 0 [Logical-Visual score]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0: [windows-1255]
SBCS: 0.00 [windows-1255]

SBCS 0.09991017: [tis-620]
SBCS: 0.09991017 [tis-620]

SBCS 0.09991017: [iso-8859-11]
SBCS: 0.09991017 [iso-8859-11]

SBCS 0.7133932: [iso-8859-1]
SBCS: 0.7133932 [iso-8859-1]

SBCS 0.6674997: [iso-8859-15]
SBCS: 0.6674997 [iso-8859-15]

SBCS 0.7133932: [windows-1252]
SBCS: 0.7133932 [windows-1252]

SBCS 0.71340704: [iso-8859-1]
SBCS: 0.713407 [iso-8859-1]

SBCS 0.67082536: [iso-8859-15]
SBCS: 0.6708254 [iso-8859-15]

SBCS 0.71340704: [windows-1252]
SBCS: 0.713407 [windows-1252]

SBCS 0.6861101: [iso-8859-2]
SBCS: 0.6861101 [iso-8859-2]

SBCS 0.6861101: [windows-1250]
SBCS: 0.6861101 [windows-1250]

SBCS 0.76677626: [iso-8859-1]
SBCS: 0.7667763 [iso-8859-1]

SBCS 0.76677626: [windows-1252]
SBCS: 0.7667763 [windows-1252]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.717128: [iso-8859-9]
SBCS: 0.717128 [iso-8859-9]

SBCS inactive: [iso-8859-6] (i.e. confidence is too low).
SBCS 0: [windows-1256]
SBCS: 0.00 [windows-1256]

SBCS 0.40016073: [viscii]
SBCS: 0.4001607 [viscii]

SBCS 0.44124976: [windows-1258]
SBCS: 0.4412498 [windows-1258]

SBCS 0.71854687: [iso-8859-15]
SBCS: 0.7185469 [iso-8859-15]

SBCS 0.7641578: [iso-8859-1]
SBCS: 0.7641578 [iso-8859-1]

SBCS 0.7641578: [windows-1252]
SBCS: 0.7641578 [windows-1252]

SBCS 0.71640146: [iso-8859-13]
SBCS: 0.7164015 [iso-8859-13]

SBCS 0.6377162: [iso-8859-10]
SBCS: 0.6377162 [iso-8859-10]

SBCS 0.6736411: [iso-8859-4]
SBCS: 0.6736411 [iso-8859-4]

SBCS 0.71818155: [iso-8859-13]
SBCS: 0.7181816 [iso-8859-13]

SBCS 0.6363546: [iso-8859-10]
SBCS: 0.6363546 [iso-8859-10]

SBCS 0.6753149: [iso-8859-4]
SBCS: 0.6753149 [iso-8859-4]

SBCS 0.666065: [iso-8859-1]
SBCS: 0.666065 [iso-8859-1]

SBCS 0.666065: [iso-8859-9]
SBCS: 0.666065 [iso-8859-9]

SBCS 0.62630904: [iso-8859-15]
SBCS: 0.626309 [iso-8859-15]

SBCS 0.666065: [windows-1252]
SBCS: 0.666065 [windows-1252]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.6366351: [windows-1250]
SBCS: 0.6366351 [windows-1250]

SBCS 0.6366351: [iso-8859-2]
SBCS: 0.6366351 [iso-8859-2]

SBCS 0.72143143: [x-mac-ce]
SBCS: 0.7214314 [x-mac-ce]

SBCS 0.72143143: [ibm852]
SBCS: 0.7214314 [ibm852]

SBCS 0.6434225: [windows-1250]
SBCS: 0.6434225 [windows-1250]

SBCS 0.64008415: [iso-8859-2]
SBCS: 0.6400841 [iso-8859-2]

SBCS 0.7291228: [x-mac-ce]
SBCS: 0.7291228 [x-mac-ce]

SBCS 0.7253399: [ibm852]
SBCS: 0.7253399 [ibm852]

SBCS 0.58494663: [windows-1250]
SBCS: 0.5849466 [windows-1250]

SBCS 0.5881849: [iso-8859-2]
SBCS: 0.5881849 [iso-8859-2]

SBCS 0.61615247: [iso-8859-13]
SBCS: 0.6161525 [iso-8859-13]

SBCS 0.58494663: [iso-8859-16]
SBCS: 0.5849466 [iso-8859-16]

SBCS 0.66285837: [x-mac-ce]
SBCS: 0.6628584 [x-mac-ce]

SBCS 0.65958494: [ibm852]
SBCS: 0.6595849 [ibm852]

SBCS 0.7628341: [iso-8859-1]
SBCS: 0.7628341 [iso-8859-1]

SBCS 0.71730226: [iso-8859-4]
SBCS: 0.7173023 [iso-8859-4]

SBCS 0.71730226: [iso-8859-9]
SBCS: 0.7173023 [iso-8859-9]

SBCS 0.7628341: [iso-8859-13]
SBCS: 0.7628341 [iso-8859-13]

SBCS 0.71730226: [iso-8859-15]
SBCS: 0.7173023 [iso-8859-15]

SBCS 0.7628341: [windows-1252]
SBCS: 0.7628341 [windows-1252]

SBCS 0.76252055: [iso-8859-1]
SBCS: 0.7625206 [iso-8859-1]

SBCS inactive: [iso-8859-3] (i.e. confidence is too low).
SBCS 0.76252055: [iso-8859-9]
SBCS: 0.7625206 [iso-8859-9]

SBCS 0.71700746: [iso-8859-15]
SBCS: 0.7170075 [iso-8859-15]

SBCS 0.76252055: [windows-1252]
SBCS: 0.7625206 [windows-1252]

SBCS 0.6695262: [windows-1250]
SBCS: 0.6695262 [windows-1250]

SBCS 0.6695262: [iso-8859-2]
SBCS: 0.6695262 [iso-8859-2]

SBCS 0.7052443: [iso-8859-13]
SBCS: 0.7052443 [iso-8859-13]

SBCS 0.6695262: [iso-8859-16]
SBCS: 0.6695262 [iso-8859-16]

SBCS 0.7587035: [x-mac-ce]
SBCS: 0.7587035 [x-mac-ce]

SBCS 0.7587035: [ibm852]
SBCS: 0.7587035 [ibm852]

SBCS 0.76380235: [windows-1252]
SBCS: 0.7638023 [windows-1252]

SBCS 0.76380235: [windows-1257]
SBCS: 0.7638023 [windows-1257]

SBCS 0.71821266: [iso-8859-4]
SBCS: 0.7182127 [iso-8859-4]

SBCS 0.76380235: [iso-8859-13]
SBCS: 0.7638023 [iso-8859-13]

SBCS 0.71821266: [iso-8859-15]
SBCS: 0.7182127 [iso-8859-15]

SBCS 0.6575037: [iso-8859-1]
SBCS: 0.6575037 [iso-8859-1]

SBCS 0.6575037: [iso-8859-9]
SBCS: 0.6575037 [iso-8859-9]

SBCS 0.61825883: [iso-8859-15]
SBCS: 0.6182588 [iso-8859-15]

SBCS 0.6575037: [windows-1252]
SBCS: 0.6575037 [windows-1252]

SBCS 0.7738685: [windows-1250]
SBCS: 0.7738685 [windows-1250]

SBCS 0.7738685: [iso-8859-2]
SBCS: 0.7738685 [iso-8859-2]

SBCS 0.7738685: [iso-8859-16]
SBCS: 0.7738685 [iso-8859-16]

SBCS 0.75962406: [ibm852]
SBCS: 0.7596241 [ibm852]

SBCS 0.66994256: [windows-1250]
SBCS: 0.6699426 [windows-1250]

SBCS 0.66994256: [iso-8859-2]
SBCS: 0.6699426 [iso-8859-2]

SBCS 0.66994256: [iso-8859-16]
SBCS: 0.6699426 [iso-8859-16]

SBCS 0.75917524: [x-mac-ce]
SBCS: 0.7591752 [x-mac-ce]

SBCS 0.75917524: [ibm852]
SBCS: 0.7591752 [ibm852]

SBCS 0.76376295: [iso-8859-1]
SBCS: 0.763763 [iso-8859-1]

SBCS 0.7181756: [iso-8859-4]
SBCS: 0.7181756 [iso-8859-4]

SBCS 0.76376295: [iso-8859-9]
SBCS: 0.763763 [iso-8859-9]

SBCS 0.7181756: [iso-8859-15]
SBCS: 0.7181756 [iso-8859-15]

SBCS 0.76376295: [windows-1252]
SBCS: 0.763763 [windows-1252]

SBCS Group found best match [windows-1250] confidence 0.7738685.

MBCS: Detected utf-8 with confidence of 0.7525

Get confidence:
-- new match found: confidence 0.7525, index 0, charset utf-8.
Get confidence done.
MBCS Group Prober --------begin status
MBCS 0.7525: [utf-8]

MBCS 0.01: [shift-jis]

MBCS 0.01: [euc-jp]

MBCS 0.01: [gb18030]

MBCS 0.01: [euc-kr]

MBCS 0.01: [cp949]

MBCS 0.01: [big5]

MBCS inactive: euc-tw (i.e. confidence is too low).
MBCS Group found best match [utf-8] confidence 0.7525.

Latin1Prober: Detected windows-1252 with confidence of 0.43269232

Latin1Prober: 0.43269232 [windows-1252]