hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Latest release breaks spell checking for Korean

mike-fabian opened this issue · comments

See also: https://bugzilla.redhat.com/show_bug.cgi?id=2158548

Test file:

korean.txt

[mfabian@fedora ~]$ cat /etc/fedora-release 
Fedora release 38 (Rawhide)
[mfabian@fedora ~]$ rpm -q hunspell
hunspell-1.7.2-1.fc38.x86_64
[mfabian@fedora ~]$ hunspell -a -d ko_KR korean.txt
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.7.2)
# 안녕하세이 0

[mfabian@fedora ~]$

That is wrong.

Downgrading to hunspell-1.7.1-1.fc38.x86_64 fixes the problem.

[mfabian@fedora ~]$ cat /etc/fedora-release 
Fedora release 38 (Rawhide)
[mfabian@fedora ~]$ rpm -q hunspell
hunspell-1.7.1-1.fc38.x86_64
[mfabian@fedora ~]$ hunspell -a -d ko_KR korean.txt
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.7.1)
& 안녕하세이 2 0: 안녕하세요, 안녕하여있다

[mfabian@fedora ~]$

bisected to:

commit 05e44e0 (HEAD)
Author: Caolán McNamara caolanm@redhat.com
Date: Thu Sep 1 13:46:40 2022 +0100

Check word limit (#813)

* check against hentry blen max

* don't leak in the case of an aff parse error

and the issue is a word of byte len 519 김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이는바둑이바둑이는돌돌이들
at line 101398 of the .dic

blen is an unsigned char, word is longer than that (in UTF8), so is newly correctly detected as not insertable so errors out and entire dict is discarded.

Options are to leave it as is, and hunspell-ko has to remove the long entries to work, silently drop it instead of flagging an error, or make blen a bigger type

lets try making blen (and clen) unsigned short as the first port of call

I think this warrants a 1.7.3 :)

Even with this patch applied some parts of hunspell-kos test fail. See https://ci.debian.net/data/autopkgtest/unstable/amd64/h/hunspell-dict-ko/30142014/log.gz

make -C tests test DICT=/usr/share/hunspell/ko
make[1]: Entering directory '/tmp/autopkgtest-lxc.1yt2gr__/downtmp/build.BXh/src/tests'
echo | hunspell -d /usr/share/hunspell/ko | head -1
Hunspell 1.7.2 - hunspell-dict-ko 0.7.92 (requires Hunspell 1.3.1) https://spellcheck-ko.github.io/
python3 checkhunspellversion.py
Testing 001-pos-dependent-inflection.test...
Testing 002-irregular-inflection.test...
Testing 003-abbreviated-inflection.test...
Testing 004-compound-removing-rieul.test...
Testing 005-vowel-harmony.test...
Testing 006-descriptive-josa.test...
Testing 007-noun-suffix-and-josa.test...
Testing 008-dependent-josa.test...
Testing 009-auxiliary-verb.test...
009-auxiliary-verb.test:17: Y 사과인듯하네: & 사과인듯하네 15 0: 사과인 듯하네, 사과인듯하네, 사관인듯하네, 사과인듯하나, 사과인듯하니, 사기인듯하네, 사고인듯하네, 사구인듯하네, 사교인듯하네, 사과인듯하냐, 사계인듯하네, 사과인듯하게, 사과인듯하데, 다과인듯하네, 사과일듯하네
009-auxiliary-verb.test:18: Y 선생님이신듯하고: & 선생님이신듯하고 1 0: 선생님이신 듯하고
Testing 010-jamo-swap.test...
Testing 011-abbreviated-verb-suggestion.test...
Testing 012-wrong-inflection.test...
Testing 013-yeoncheol-buncheol.test...
Testing 014-sai-sios.test...
Testing 015-beginning-sound-rule.test...
Testing 016-numbers.test...
make[1]: *** [Makefile:9: test] Error 1
make: *** [Makefile:53: hosttest] Error 2
make[1]: Leaving directory '/tmp/autopkgtest-lxc.1yt2gr__/downtmp/build.BXh/src/tests'
/tmp/autopkgtest-lxc.1yt2gr__/downtmp/wrapper.sh: Killing leaked background processes: 1525 
    PID TTY      STAT   TIME COMMAND
   1525 ?        R      0:01 hunspell -i UTF-8 -d /usr/share/hunspell/ko
autopkgtest [14:40:59]: test command1: -----------------------]
command1             FAIL non-zero exit status 2
autopkgtest [14:40:59]: test command1:  - - - - - - - - - - results - - - - - - - - - -
autopkgtest [14:40:59]: test command1:  - - - - - - - - - - stderr - - - - - - - - - -
009-auxiliary-verb.test:17: Y 사과인듯하네: & 사과인듯하네 15 0: 사과인 듯하네, 사과인듯하네, 사관인듯하네, 사과인듯하나, 사과인듯하니, 사기인듯하네, 사고인듯하네, 사구인듯하네, 사교인듯하네, 사과인듯하냐, 사계인듯하네, 사과인듯하게, 사과인듯하데, 다과인듯하네, 사과일듯하네
009-auxiliary-verb.test:18: Y 선생님이신듯하고: & 선생님이신듯하고 1 0: 선생님이신 듯하고
make[1]: *** [Makefile:9: test] Error 1
make: *** [Makefile:53: hosttest] Error 2

That long string 김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이는바둑이바둑이는돌돌이들 is just kinda easter egg. It's okay to remove.

References:
spellcheck-ko/hunspell-dict-ko#50 (comment)
https://github.com/spellcheck-ko/hunspell-dict-ko/blob/master/dict-ko-builtins.yaml#L208
https://forum.wordreference.com/threads/%EA%B9%80%EC%88%98%ED%95%9C%EB%AC%B4-%EA%B1%B0%EB%B6%81%EC%9D%B4%EC%99%80%EF%BB%BF-%EB%91%90%EB%A3%A8%EB%AF%B8.2115344/

edit: add some more references.