hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Copyright symbol input suggests invalid utf8 sequence

drahnr opened this issue · comments

Versions 1.7.0 and 1.7.1 at least.

0xe2 0xa9 -> 0xc2 0x80 0x93

[2022-09-16T12:30:45Z DEBUG hunspell] txt --{suggest}--> ["text", "ext", "tit", "tat", "tot", "tut", "TX"]
Encountered error Utf8Error { valid_up_to: 2, error_len: Some(1) } returned from Hunspell_suggest(handle, ["\"\\xc2\\xa9\""]): 0: "\xc2\x80\x93"
[2022-09-16T12:30:45Z DEBUG hunspell] © --{suggest}--> []

Relevant issue: drahnr/cargo-spellcheck#281

is this something that I could reproduce with standalone hunspell?

I nailed it down to some dash variants inside an extra dictionary. I did not create a test case just yet.

@caolanm minimal example:

#include <hunspell/hunspell.hxx>
#include <fstream>
#include <sstream>
#include <iostream>


int main(void)
{
    std::string word("©");
    printf(">>%s<<\n", word.c_str());
    Hunspell dict("/usr/share/hunspell/en_US.aff", "/usr/share/hunspell/en_US.dic");
    dict.add_dic("extra.dic", nullptr);
    
        if (!dict.spell(word))
   { 
   auto v= dict.suggest(word);
   for (auto iter = v.cbegin(); iter != v.cend(); ++iter) {
        std::cout << "suggestion:" << *iter << std::endl;
   }
   
} else {
    printf("GOOD\n");
}
    return 0;
}

with extra.dic (in case gh changes this, the proper characters are in https://github.com/drahnr/cargo-spellcheck/blob/4dc6fe8756505202c39f2ec9c1bea33e4b138f64/src/tests.rs#L263-L264)

2
—
–

resulting in

>>©<<
suggestion:�

yeah, the input is utf-8, but the en_US.aff has SET ISO-8859-1 at the top, and the assumption is rather baked in that input is in the dictionary encoding (stuff like LibreOffice converts to the dictionary encoding when spell checking). The input of © is mangled by mkallsmall and it's all downhill after that.

Might be that the best route is to encourage dictionaries to use UTF-8 encoding. I should at least look into the Fedora en dicts which are probably a bit stale anyway.

For Fedora at least I can bump to latest wordlist release https://github.com/en-wl/wordlist/releases/tag/rel-2020.12.07 where the en-US dictionary aff is now SET UTF-8 https://koji.fedoraproject.org/koji/taskinfo?taskID=92217900 so utf-8 input text already matches the dictionary encoding and that gives:
`

©<<
suggestion:e
suggestion:s
suggestion:i
suggestion:a
suggestion:n
suggestion:r
suggestion:t
suggestion:o
suggestion:l
suggestion:c
suggestion:d
suggestion:u
suggestion:g
suggestion:m
suggestion:p
`
for this example. I don't think there's anything else that we can do.

Documenting the assumption of encoding matches between aff/dic and provided input strings would be very helpful.