surprising scores on short strings
blackmad opened this issue · comments
Hi! We've been playing with tinyld for identifying the language of short search queries and have been a little surprised by strings that seem pretty clearly english to us being very hard for it to give us high accuracy signals. Is it a known limitation that tinyld struggles with short text?
"search sprint 1" gives us
Merge Results [
{ lang: 'ga', accuracy: 0.08333333333333333 },
{ lang: 'et', accuracy: 0.044066666666666664 },
{ lang: 'ro', accuracy: 0.03285 },
{ lang: 'es', accuracy: 0.030449999999999994 },
{ lang: 'en', accuracy: 0.014425000000000002 }
]
with only=en, we get an accuracy of 0.117 for english on that string
new hire onboarding, only=en -> 0.058
codebase modularization, only=en -> 0
Yes It's a normal problem, to avoid repeating myself I created a FAQ and answered here
I have few ideas that could help for shorter string accuracy but nothing magical