Unable to identify Starbuck text in language detection

Question

Unable to identify Starbuck text in language detection

PipulPant1 opened this issue 2 years ago · comments

PIPUL PANT commented 2 years ago

console.log(detectAll('Starbucks')) or detect('Starbucks')

This is returning an empty array. Can you please resolve this issue.

Kevin Destrem · Answer 1 · Wed Nov 02 2022 16:27:13 GMT+0800 (China Standard Time)

Indeed, result of the verbose mode

tinyld  --verbose starbuck

Analize chunks [ 'starbuck' ]
[Pass 2] DetectPotentialGrams starbuck [
  's',     't',     'a',     'r',     'b',
  'u',     'c',     'k',     ' s',    'st',
  'ta',    'ar',    'rb',    'bu',    'uc',
  'ck',    'k ',    ' st',   'sta',   'tar',
  'arb',   'rbu',   'buc',   'uck',   'ck ',
  ' sta',  'star',  'tarb',  'arbu',  'rbuc',
  'buck',  'uck ',  ' star', 'starb', 'tarbu',
  'arbuc', 'rbuck', 'buck '
]
Result starbuck [
  { lang: 'af', accuracy: 0, score: 0 },
  { lang: 'am', accuracy: 0, score: 0 },
  { lang: 'ber', accuracy: 0, score: 0 },
  { lang: 'rn', accuracy: 0, score: 0 },
  { lang: 'ja', accuracy: 0, score: 0 },
  { lang: 'zh', accuracy: 0, score: 0 },
  { lang: 'ko', accuracy: 0, score: 0 },
  { lang: 'my', accuracy: 0, score: 0 }
]
Merge Results []

There is no bug here. It's just a really short text with nothing specific to any language, so the library doesn't guess and just answer "I don't know".

It gives the same for lot of brand name like mcdonald, microsoft, nike.
And when it detect something it's often only a single gram and is quite inaccurate

apple -> sv
auchan -> de

It's a kind of word usually detected based on the context and other words around:

this is a starbuck -> en
let's go to starbuck -> en
allons au starbuck -> fr
starbucksに行こう -> ja

Not much to do here except having more language grams and a bigger database, but even with that, with so few characters and generic grams the result will always be kind of inaccurate.

Kevin Destrem · Answer 2 · Wed Nov 02 2022 17:31:46 GMT+0800 (China Standard Time)

Just for testing, I made and benchmark a version with a bigger text database (2.5MB).
So something really for server side usage, with a bigger memory footprint.

It slightly increase the overall accuracy ~99.3% (+0.9%) and increase the accuracy with a small number of character.
The 95% accuracy pass from 24 to 16 characters

But even with that brute force method, it doesn't change anything for those single word detection.
It's just order of magnitude of what this kind of approach can do.

> ./bin/tinyld-large.js starbuck
[
  { lang: 'lv', accuracy: 0.125 },
  { lang: 'lt', accuracy: 0.0921 },
  { lang: 'en', accuracy: 0.039474999999999996 },
  { lang: 'de', accuracy: 0.0329 }
]

> ./bin/tinyld-large.js mcdonalds
[ { lang: 'is', accuracy: 1 } ]

$ ./bin/tinyld-large.js apple
[
  { lang: 'sv', accuracy: 0.2 },
  { lang: 'no', accuracy: 0.13334000000000001 },
  { lang: 'en', accuracy: 0.07778 }
]

PIPUL PANT · Answer 3 · Wed Nov 02 2022 18:59:32 GMT+0800 (China Standard Time)

@kefniark I have tried with other text as well but this is not working as expected.
Platos Desayuno return should be es but the return is tr. I am using this package to detect the food name and this is failing for multiple cases. Any help what kind of string should be avoided to detect.

Other strings are : The return type should be es but failing for most of the string.

Guarniciones de
ensaladas
Panadería
Licores
Cerveza
Café
Bebidas
Vino por copa
Vino por botella
Cocteles

Kevin Destrem · Answer 4 · Wed Nov 02 2022 22:01:10 GMT+0800 (China Standard Time)

This library rely on statistic analysis, which works well but require a certain amount of characters to build up some statistics and recognize some patterns. And tbh, that's the case of 90% of language detection library.

Based on your list and the fact that most of your names are under 16 characters, don't waste your time it's not the kind of algorithm you are looking for. Maybe AI can do slightly better but I still expect a high error rate as they are trained to detect documents most of the time.

The simpler and more accurate if you are using it for something specific like "food names" is probably to build some dictionary matching. Not trying to guess a language but just doing some straight word matching "vino" -> es

Kevin Destrem · Answer 5 · Thu Nov 10 2022 13:27:31 GMT+0800 (China Standard Time)

Because I got another issue (#19) about the same topic "Detection of short texts", I decided to create a FAQ

Here is the answer, a summary of what was already discussed in this issue