hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Include abbreviations in the list of bad words

jerin248486 opened this issue · comments

Hi,
I am working with hunspell in R to spell check some job titles. The issue that I am facing now is that there are some job titles labelled as 'MGR' instead of 'Manager'. Hunspell identify MGR as a good word since it is an accepted abbreviation in the dictionary. But I don't want to use any abbreviations as it is. So is there any way I can add those abbreviations too in the bad word list? Something like, disabling the abbreviations from the dictionary or something like that?

at least for the command line tools "man 5 hunspell" mentions the personal dictionary format and the "*" format to flag prohibition so for me:

$ echo "manager mgr" | hunspell -l

shows nothing while,

$ echo "manager mgr" | hunspell -l -p ~/.hunspell_default
mgr

where ~/.hunspell_default is

$ cat ~/.hunspell_default
*mgr

I am sorry that I conveyed by issue wrong. I need all commonly used abbreviations to be removed from the dictionary, not just 'mgr'. Because, I am unsure that which all such abbreviations might be there in the whole of my dataset. 'Mgr' is just one of them that I noticed! Once again, I am sorry for posting the question without clarity.

Checking my /usr/share/hunspell/en_US.dic I see that mgr is listed just like any other word so as far as I can see there is no metadata or specific rule in place that identifies that mgr is an abbreviation so AFAICT there isn't a built in or an obvious automatible way to remove all abbreviations from the dictionary. One would have to go up a level to where the dictionary comes from (for Fedora for example its built from SCOWL at http://wordlist.aspell.net/ ) and dig around there to see if input words are categorized as abbreviations and if so restrict them from getting included when building a dictionary.