hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does Hunspell handle agglutinative languages like Turkish?

lancejpollard opened this issue · comments

The Turkish Hunspell .aff file has over 50,000+ affixes, all of which say N (No) for the suffix. They are also including very long suffixes.

SFX 3 N 1
SFX 3 0 cilerdensin .

According to this blog post, there are I think around 700 suffixes last time I counted. Then they can be combined in arbitrary ways, sometimes having over 10+ suffixes concatenated onto the base word. I would think in principle you would store some sort of Directed Acyclic Graph for allowing dynamically computing possible/theoretical words which have never been encountered before, but it appears the Hunspell Turkish dictionary is precompiling possible suffix chains and just making them as SFX ... N (no chaining). Am I reading that correctly?

In newer Hunspell, is there a more idiomatic way of solving this with less suffixes?

I feel like I read somewhere that Hunspell can only support 2 prefixes or 2 suffixes, or 1 of each together. Is something like that an issue here, the reason for the way they organize the Turkish dictionary?

Thank you so much for your help!