How does Hunspell handle agglutinative languages like Turkish?
lancejpollard opened this issue · comments
The Turkish Hunspell .aff
file has over 50,000+ affixes, all of which say N
(No) for the suffix. They are also including very long suffixes.
SFX 3 N 1
SFX 3 0 cilerdensin .
According to this blog post, there are I think around 700 suffixes last time I counted. Then they can be combined in arbitrary ways, sometimes having over 10+ suffixes concatenated onto the base word. I would think in principle you would store some sort of Directed Acyclic Graph for allowing dynamically computing possible/theoretical words which have never been encountered before, but it appears the Hunspell Turkish dictionary is precompiling possible suffix chains and just making them as SFX ... N
(no chaining). Am I reading that correctly?
In newer Hunspell, is there a more idiomatic way of solving this with less suffixes?
I feel like I read somewhere that Hunspell can only support 2 prefixes or 2 suffixes, or 1 of each together. Is something like that an issue here, the reason for the way they organize the Turkish dictionary?
Thank you so much for your help!