hunspell stumbles over copyright symbol
drahnr opened this issue · comments
Describe the bug
Encountered error Utf8Error { valid_up_to: 2, error_len: Some(1) } returned from Hunspell_suggest(handle, ["\"\\xc2\\xa9\""]): 0: "\xc2\x80\x93"
[2022-09-16T10:03:50Z DEBUG hunspell] © --{suggest}--> []
To Reproduce
Steps to reproduce the behaviour:
- A file containing
©
- Run
cargo spellcheck file.rs
- ...
Expected behavior
Handle or ignore, currently hunspell-rs
is hacked to print an error.
Screenshots
Please complete the following information:
- System: Fedora
- Obtained: cargo + git
- Version: 0.12.2 / git
CC @lopopolo that was the issue at hand, the suggestion should accept 0xC2 0xA9 as valid since it is valid itself, but returns garbage suggestions instead. It's still present in 0.12.2 but will only be a verbose message that will be handled in the next release.
Oh awesome. That's in the generated headers of the Unicode files. Great catch @drahnr and thanks for debugging!
@lopopolo I realized you have custom dict, with a -
- removing the single char items from the list resolves the issue. It seems that trips the parser and makes it's way into the lut inside hunspell and then surfaces with some byte sequences.
I just tried this workaround and that seemed to work! I get no warning messages in cargo-spellcheck 0.12.1
.
The core issue is due to the fact that Hunspell
uses the encoding used in the affix file, for the dictionaries as well. For en_us.aff
(both builtin and Fedora 36) this was latin-1
encoding rather than utf-8
. Hunspell
then treats all inputs of encoding equiv to the affix file, implicitly by the used prefix tree.
Solution would be to use i.e. encoding_rs to re-encode the dictionaries to UTF-8 and only afterwards feed them to Hunspell
or reject all encodings set besides utf-8
.