Copyright symbol input suggests invalid utf8 sequence
drahnr opened this issue · comments
Versions 1.7.0 and 1.7.1 at least.
0xe2 0xa9 -> 0xc2 0x80 0x93
[2022-09-16T12:30:45Z DEBUG hunspell] txt --{suggest}--> ["text", "ext", "tit", "tat", "tot", "tut", "TX"]
Encountered error Utf8Error { valid_up_to: 2, error_len: Some(1) } returned from Hunspell_suggest(handle, ["\"\\xc2\\xa9\""]): 0: "\xc2\x80\x93"
[2022-09-16T12:30:45Z DEBUG hunspell] © --{suggest}--> []
Relevant issue: drahnr/cargo-spellcheck#281
is this something that I could reproduce with standalone hunspell?
I nailed it down to some dash variants inside an extra dictionary. I did not create a test case just yet.
@caolanm minimal example:
#include <hunspell/hunspell.hxx>
#include <fstream>
#include <sstream>
#include <iostream>
int main(void)
{
std::string word("©");
printf(">>%s<<\n", word.c_str());
Hunspell dict("/usr/share/hunspell/en_US.aff", "/usr/share/hunspell/en_US.dic");
dict.add_dic("extra.dic", nullptr);
if (!dict.spell(word))
{
auto v= dict.suggest(word);
for (auto iter = v.cbegin(); iter != v.cend(); ++iter) {
std::cout << "suggestion:" << *iter << std::endl;
}
} else {
printf("GOOD\n");
}
return 0;
}
with extra.dic
(in case gh changes this, the proper characters are in https://github.com/drahnr/cargo-spellcheck/blob/4dc6fe8756505202c39f2ec9c1bea33e4b138f64/src/tests.rs#L263-L264)
2
—
–
resulting in
>>©<<
suggestion:�
yeah, the input is utf-8, but the en_US.aff has SET ISO-8859-1 at the top, and the assumption is rather baked in that input is in the dictionary encoding (stuff like LibreOffice converts to the dictionary encoding when spell checking). The input of © is mangled by mkallsmall and it's all downhill after that.
Might be that the best route is to encourage dictionaries to use UTF-8 encoding. I should at least look into the Fedora en dicts which are probably a bit stale anyway.
For Fedora at least I can bump to latest wordlist release https://github.com/en-wl/wordlist/releases/tag/rel-2020.12.07 where the en-US dictionary aff is now SET UTF-8 https://koji.fedoraproject.org/koji/taskinfo?taskID=92217900 so utf-8 input text already matches the dictionary encoding and that gives:
`
©<<
suggestion:e
suggestion:s
suggestion:i
suggestion:a
suggestion:n
suggestion:r
suggestion:t
suggestion:o
suggestion:l
suggestion:c
suggestion:d
suggestion:u
suggestion:g
suggestion:m
suggestion:p
`
for this example. I don't think there's anything else that we can do.
Documenting the assumption of encoding matches between aff/dic and provided input strings would be very helpful.