komodojp / tinyld

Simple and Performant Language detection library for NodeJS

Home Page:https://komodojp.github.io/tinyld/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

surprising scores on short strings

blackmad opened this issue · comments

Hi! We've been playing with tinyld for identifying the language of short search queries and have been a little surprised by strings that seem pretty clearly english to us being very hard for it to give us high accuracy signals. Is it a known limitation that tinyld struggles with short text?

"search sprint 1" gives us
Merge Results [
{ lang: 'ga', accuracy: 0.08333333333333333 },
{ lang: 'et', accuracy: 0.044066666666666664 },
{ lang: 'ro', accuracy: 0.03285 },
{ lang: 'es', accuracy: 0.030449999999999994 },
{ lang: 'en', accuracy: 0.014425000000000002 }
]

with only=en, we get an accuracy of 0.117 for english on that string

new hire onboarding, only=en -> 0.058
codebase modularization, only=en -> 0

Yes It's a normal problem, to avoid repeating myself I created a FAQ and answered here

I have few ideas that could help for shorter string accuracy but nothing magical