About Kannada and Hindi Script to Roman models

Question

About Kannada and Hindi Script to Roman models

loretoparisi opened this issue 6 years ago · comments

Hello, I have a question about Kannada. Given these roman words

sundara sundaraa sunder
chandiraa chandira chandir
mana man

I get this script

ಸುಂದರಾ ಸುಂದರಾ ಸುಂದರ
ಚಂದೀರಾ ಚಂದೀರಾ ಚಂದಿರ
ಮನಾ ಮನ

So it seems that the model acts in the same (maybe wrong?) way for some words like

mana -> ಮನಾ, that actually it should be mana -> ಮನ.

The same happens for sundara (sunder) and chandira (chandir) and the related transliterations.
I'm using the kan-eng / eng-kan model in this case.

Loreto Parisi · Answer 1 · Fri Feb 01 2019 23:07:22 GMT+0800 (China Standard Time)

[UPDATE]
This happens for kavan/kavana too for the model kan-eng

ಕವನ -> kavan
ಕವಾನಾ  -> kavana

while it should be ಕವನ -> kavana

We have found actually that using beamsearch the roman kavana is among the results, while viterbi gives kavan as the best option, while in most of other models (like hin-eng, mar-eng, etc) the viterbi picked word was correct one.
Any way to disambiguate these cases?

Loreto Parisi · Answer 2 · Thu Feb 14 2019 00:53:31 GMT+0800 (China Standard Time)

I have also found similar issues in the hin-eng model for Hindi

There is the usage of v instead of w and vice versa. In this case it should be fine in both cases; the usage of c instead of k, that it should not be correct instead. In the same way this model is using or instead of aur, where the latter should be fine.

Irshad Ahmad · Answer 3 · Thu Feb 14 2019 03:49:12 GMT+0800 (China Standard Time)

These are not the issues of transliteration model, rather that of the Romanization of the Indic scripts. See #36.

Loreto Parisi · Answer 4 · Thu Feb 14 2019 17:25:30 GMT+0800 (China Standard Time)

@irshadbhat thank you very much. Would you be so kind to explain to me how the romanization fails in this case? I assume you mean that indic-trans transliteration phase worked, but the next step (i.e. the romanization) failed. Is that correct?

Irshad Ahmad · Answer 5 · Thu Feb 14 2019 22:26:26 GMT+0800 (China Standard Time)

You totally misunderstood. My point is when we Romanize a script, there is not a fixed spelling for the word. There is a lot of spell variation involved. Some would prefer one spelling and some other. Again, for example, for the word बहुत some write bohat, some bohut and some bahut. So, when you say, "the model is using or instead of aur, where the latter should be fine", that is because you prefer aur but someone else might prefer or. I have read a lot of Romanized text and I know for the fact or is more frequently used than aur. I understand you want the model to predict better, but when we humans are confused (or not concerned to stick to one spell variation only), how can a machine do it perfectly.

Loreto Parisi · Answer 6 · Fri Feb 15 2019 01:20:32 GMT+0800 (China Standard Time)

@irshadbhat you are perfectly right, thank you very much, it makes sense!