libindic / indic-trans

The project aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including English.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About Kannada and Hindi Script to Roman models

loretoparisi opened this issue · comments

Hello, I have a question about Kannada. Given these roman words

sundara sundaraa sunder
chandiraa chandira chandir
mana man

I get this script

ಸುಂದರಾ ಸುಂದರಾ ಸುಂದರ
ಚಂದೀರಾ ಚಂದೀರಾ ಚಂದಿರ
ಮನಾ ಮನ

So it seems that the model acts in the same (maybe wrong?) way for some words like

mana -> ಮನಾ, that actually it should be mana -> ಮನ.

The same happens for sundara (sunder) and chandira (chandir) and the related transliterations.
I'm using the kan-eng / eng-kan model in this case.

[UPDATE]
This happens for kavan/kavana too for the model kan-eng

ಕವನ -> kavan
ಕವಾನಾ  -> kavana

while it should be ಕವನ -> kavana

We have found actually that using beamsearch the roman kavana is among the results, while viterbi gives kavan as the best option, while in most of other models (like hin-eng, mar-eng, etc) the viterbi picked word was correct one.
Any way to disambiguate these cases?

I have also found similar issues in the hin-eng model for Hindi

There is the usage of v instead of w and vice versa. In this case it should be fine in both cases; the usage of c instead of k, that it should not be correct instead. In the same way this model is using or instead of aur, where the latter should be fine.

These are not the issues of transliteration model, rather that of the Romanization of the Indic scripts. See #36.

@irshadbhat thank you very much. Would you be so kind to explain to me how the romanization fails in this case? I assume you mean that indic-trans transliteration phase worked, but the next step (i.e. the romanization) failed. Is that correct?

You totally misunderstood. My point is when we Romanize a script, there is not a fixed spelling for the word. There is a lot of spell variation involved. Some would prefer one spelling and some other. Again, for example, for the word बहुत some write bohat, some bohut and some bahut. So, when you say, "the model is using or instead of aur, where the latter should be fine", that is because you prefer aur but someone else might prefer or. I have read a lot of Romanized text and I know for the fact or is more frequently used than aur. I understand you want the model to predict better, but when we humans are confused (or not concerned to stick to one spell variation only), how can a machine do it perfectly.

@irshadbhat you are perfectly right, thank you very much, it makes sense!