hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unmunch + hunspell, is anything wrong here?

olea opened this issue · comments

Hi:

I'm a little helper for the hunspell-es team (RLA-ES). I randomly made this test for fun but I found a weird, to me, result:

$ git clone https://git.libreoffice.org/dictionaries/
$ DICC=$(pwd)/dictionaries/es/
$ unmunch ${DICC}/es.dic ${DICC}/es.aff > es.unmunched
$  wc -l es.unmunched 
1284912 es.unmunched
$ hunspell -d "${DICC}/es" -l es.unmunched |wc -l
520877

Honestly I would have expected a result of 0 lines for the spellchecking operation. What I'm doing wrong? I misunderstood what the unmunch output is? Maybe there is a significant problem in the Spanish dictionary?

I really don't know how to interpret this results and what action, if any, should be done.

Thanks a lot.

I'm not a developer of Hunspell, but from what I understand, what you're seeing there is a problem with Unmunch, not your dictionary. I have the same problem with my dictionaries. I think Unmunch tries to generate words that exist, but it doesn't always generate words that actually exist. So Hunspell will find errors in the output from Unmunch, because some of the words it created aren't actually in the dictionary. Unmunch also doesn't generate all the words that the dictionary files actually contain: for example, I just tested Unmunch on my dictionary and it didn't create many words that I know the dictionary contains.

If you want to generate all the words in your dictionary, you may be interested in a program I've written that does this: https://github.com/fin-w/LibreOffice-Geiriadur-Cymraeg-Welsh-Dictionary/blob/main/wordforms
It's an entirely rewritten version of Hunspell's Wordforms script, with the same functionality, faster run times (usually) for generating the affixed variations of single words, and the ability to generate all word variations in the dictionary, like Unmunch is supposed to do. I will warn you though, it's only a proof-of-concept, and it takes a long time to generate every word in the dictionary (particularly with the Spanish dictionary). Instead of using Unmunch, you can use wordforms -g es.aff es.dic es.unmunched with my script and it should generate all the words in your dictionary and put them in es.unmunched.