surprising scores on short strings

Question

surprising scores on short strings

blackmad opened this issue 2 years ago · comments

Hi! We've been playing with tinyld for identifying the language of short search queries and have been a little surprised by strings that seem pretty clearly english to us being very hard for it to give us high accuracy signals. Is it a known limitation that tinyld struggles with short text?

"search sprint 1" gives us
Merge Results [
{ lang: 'ga', accuracy: 0.08333333333333333 },
{ lang: 'et', accuracy: 0.044066666666666664 },
{ lang: 'ro', accuracy: 0.03285 },
{ lang: 'es', accuracy: 0.030449999999999994 },
{ lang: 'en', accuracy: 0.014425000000000002 }
]

with only=en, we get an accuracy of 0.117 for english on that string

new hire onboarding, only=en -> 0.058
codebase modularization, only=en -> 0

Kevin Destrem · Answer 1 · Thu Nov 10 2022 13:15:36 GMT+0800 (China Standard Time)

Yes It's a normal problem, to avoid repeating myself I created a FAQ and answered here

I have few ideas that could help for shorter string accuracy but nothing magical

David Blackman · Answer 2 · Fri Nov 11 2022 01:23:58 GMT+0800 (China Standard Time)

thanks for the link, apologies we hadn't found that already.

…

On Thu, Nov 10, 2022 at 12:15 AM Kevin Destrem ***@***.***> wrote: Closed #19 <#19> as completed. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADMZMBHUHAZRVPBK5SSGOTWHSAIHANCNFSM6AAAAAAR3ZW75U> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *David Blackman* creative technologist & wandering help me find my purpose <http://purpose.blackmad.com>