About Kannada and Hindi Script to Roman models
loretoparisi opened this issue · comments
Hello, I have a question about Kannada. Given these roman words
sundara sundaraa sunder
chandiraa chandira chandir
mana man
I get this script
ಸುಂದರಾ ಸುಂದರಾ ಸುಂದರ
ಚಂದೀರಾ ಚಂದೀರಾ ಚಂದಿರ
ಮನಾ ಮನ
So it seems that the model acts in the same (maybe wrong?) way for some words like
mana
-> ಮನಾ
, that actually it should be mana
-> ಮನ
.
The same happens for sundara
(sunder) and chandira
(chandir) and the related transliterations.
I'm using the kan-eng
/ eng-kan
model in this case.
[UPDATE]
This happens for kavan/kavana
too for the model kan-eng
ಕವನ -> kavan
ಕವಾನಾ -> kavana
while it should be ಕವನ -> kavana
We have found actually that using beamsearch
the roman kavana
is among the results, while viterbi
gives kavan
as the best option, while in most of other models (like hin-eng
, mar-eng
, etc) the viterbi
picked word was correct one.
Any way to disambiguate these cases?
I have also found similar issues in the hin-eng
model for Hindi
There is the usage of v
instead of w
and vice versa. In this case it should be fine in both cases; the usage of c
instead of k
, that it should not be correct instead. In the same way this model is using or
instead of aur
, where the latter should be fine.
These are not the issues of transliteration model, rather that of the Romanization of the Indic scripts. See #36.
@irshadbhat thank you very much. Would you be so kind to explain to me how the romanization fails in this case? I assume you mean that indic-trans transliteration phase worked, but the next step (i.e. the romanization) failed. Is that correct?
You totally misunderstood. My point is when we Romanize a script, there is not a fixed spelling for the word. There is a lot of spell variation involved. Some would prefer one spelling and some other. Again, for example, for the word बहुत some write bohat, some bohut and some bahut. So, when you say, "the model is using or
instead of aur
, where the latter should be fine", that is because you prefer aur
but someone else might prefer or
. I have read a lot of Romanized text and I know for the fact or
is more frequently used than aur
. I understand you want the model to predict better, but when we humans are confused (or not concerned to stick to one spell variation only), how can a machine do it perfectly.
@irshadbhat you are perfectly right, thank you very much, it makes sense!