hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Seemingly very similar suggestions are not found in French

grothesque opened this issue · comments

This is with Hunspell 1.7.1 as found in Debian bookworm. There is no particular configuration. The dictionary is the recommended one from the package hunspell-fr-classical.

I rely heavily on hunspell for correcting my correspondence in French. (Thanks!) As a non-native speaker of that language, I have particular difficulties with getting the accents right. I notice that quite often suggestions that would seem very similar are not found by hunspell.

In my experience, getting one accent wrong often means that no other mistake is allowed, or Hunspell will not find the correct suggestion. To me, this happens all the time...

Here are some examples:

$ echo telecharger wikipedia batimont | hunspell -d fr_FR
Hunspell 1.7.1
& telecharger 3 0: recharger, chanterelle, charlater
& wikipedia 1 12: stipendia
& batimont 1 22: intimation

Discussion:

  • telecharger -> télécharger: I would expect hunspell to find this one since it differs only by two accents. Instead it proposes words that are quite different!
  • wikipedia -> Wikipédia: Here what's missing is one accent and the capitalization of one letter.
  • batimont -> bâtiment: If hunspell considers pronunciation, these should be very similar.

Do you have this line in your affix file?

REP e é

I guess you are using 6.4 version of hunspell-fr-classical package. Can you try using 7.0?

Do you have this line in your affix file?

REP e é

I have no personal affix files, but the files /usr/share/hunspell/fr*.aff contain it:

$ grep '^REP e é' /usr/share/hunspell/fr*.aff
/usr/share/hunspell/fr.aff:REP e é
/usr/share/hunspell/fr_BE.aff:REP e é
/usr/share/hunspell/fr_CA.aff:REP e é
/usr/share/hunspell/fr_CH.aff:REP e é
/usr/share/hunspell/fr_FR.aff:REP e é
/usr/share/hunspell/fr_LU.aff:REP e é
/usr/share/hunspell/fr_MC.aff:REP e é

I guess you are using 6.4 version of hunspell-fr-classical package. Can you try using 7.0?

This is with Debian bookworm, so it's already 7.0:

$ apt policy hunspell-fr-classical 
hunspell-fr-classical:
  Installed: 1:7.0-1
  Candidate: 1:7.0-1
  Version table:
 *** 1:7.0-1 500
        500 http://deb.debian.org/debian bookworm/main amd64 Packages
        100 /var/lib/dpkg/status

Hunspell is good when there is one change from the correct spelling;
but it is quite bad when there are two (or more) changes.

"télecharger" or "telécharger" it will find easily. But "telecharger" not.

It seems the fr_FR dictionnary doesn't has phonetic rules.

For the Walloon dictionnary I mantain, I have 188 phonetic rules.

Here are some possibilities for French phonetic rules
(the "phonetic" symbol could be anything, I used "X" for the [S] sound (as in "chat, château")

PHONE QU(EIÈÉÊÎ)- K
PHONE QU(AOUÅ)- KW
PHONE Q K
PHONE X(ABCDEÈÉÊÎFGIÎJKLMNOPRSTUVWYZ) KS
PHONE CH X
PHONE CE$ S
PHONE CES$ S
PHONE C(EÉÈÊÎ)- S
PHONE C$ _
PHONE C K
PHONE Ç S
PHONE AI E
PHONE E$ _
PHONE E E
PHONE É E
PHONE È E
PHONE Ê E
PHONE S$ _
PHONE AN ON
PHONE ON ON

The syntax can be seen here:
http://aspell.net/man-html/Phonetic-Code.html
(hunspell just included the phonet code from aspell).

Adding this to the fr_FR.aff file :

PHONE 7
PHONE Â A
PHONE AI E
PHONE AU O
PHONE EAU O
PHONE É E
PHONE EN Q
PHONE ON Q

I can have good results for a missing accents:

$ echo telecharger wikipedia | hunspell -d fr_FR
Hunspell 1.7.0
& telecharger 6 0: télécharger, recharger, rechargeable, contrecharge, préchargement, africaniser
& wikipedia 3 12: Wikipédia, vidéoclip, illuminer

(for "batimont", even with some rules that should give the same "phonetic" representation for "bâtiment" and "batimont", it still doesn't work, I don't understand why)