komodojp / tinyld

Simple and Performant Language detection library for NodeJS

Home Page:https://komodojp.github.io/tinyld/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to identify Starbuck text in language detection

PipulPant1 opened this issue · comments

console.log(detectAll('Starbucks')) or detect('Starbucks')

This is returning an empty array. Can you please resolve this issue.

Indeed, result of the verbose mode

tinyld  --verbose starbuck

Analize chunks [ 'starbuck' ]
[Pass 2] DetectPotentialGrams starbuck [
  's',     't',     'a',     'r',     'b',
  'u',     'c',     'k',     ' s',    'st',
  'ta',    'ar',    'rb',    'bu',    'uc',
  'ck',    'k ',    ' st',   'sta',   'tar',
  'arb',   'rbu',   'buc',   'uck',   'ck ',
  ' sta',  'star',  'tarb',  'arbu',  'rbuc',
  'buck',  'uck ',  ' star', 'starb', 'tarbu',
  'arbuc', 'rbuck', 'buck '
]
Result starbuck [
  { lang: 'af', accuracy: 0, score: 0 },
  { lang: 'am', accuracy: 0, score: 0 },
  { lang: 'ber', accuracy: 0, score: 0 },
  { lang: 'rn', accuracy: 0, score: 0 },
  { lang: 'ja', accuracy: 0, score: 0 },
  { lang: 'zh', accuracy: 0, score: 0 },
  { lang: 'ko', accuracy: 0, score: 0 },
  { lang: 'my', accuracy: 0, score: 0 }
]
Merge Results []

There is no bug here. It's just a really short text with nothing specific to any language, so the library doesn't guess and just answer "I don't know".

It gives the same for lot of brand name like mcdonald, microsoft, nike.
And when it detect something it's often only a single gram and is quite inaccurate

  • apple -> sv
  • auchan -> de

It's a kind of word usually detected based on the context and other words around:

  • this is a starbuck -> en
  • let's go to starbuck -> en
  • allons au starbuck -> fr
  • starbucksに行こう -> ja

Not much to do here except having more language grams and a bigger database, but even with that, with so few characters and generic grams the result will always be kind of inaccurate.

Just for testing, I made and benchmark a version with a bigger text database (2.5MB).
So something really for server side usage, with a bigger memory footprint.

It slightly increase the overall accuracy ~99.3% (+0.9%) and increase the accuracy with a small number of character.
The 95% accuracy pass from 24 to 16 characters
image
image

But even with that brute force method, it doesn't change anything for those single word detection.
It's just order of magnitude of what this kind of approach can do.

> ./bin/tinyld-large.js starbuck
[
  { lang: 'lv', accuracy: 0.125 },
  { lang: 'lt', accuracy: 0.0921 },
  { lang: 'en', accuracy: 0.039474999999999996 },
  { lang: 'de', accuracy: 0.0329 }
]

> ./bin/tinyld-large.js mcdonalds
[ { lang: 'is', accuracy: 1 } ]

$ ./bin/tinyld-large.js apple
[
  { lang: 'sv', accuracy: 0.2 },
  { lang: 'no', accuracy: 0.13334000000000001 },
  { lang: 'en', accuracy: 0.07778 }
]

@kefniark I have tried with other text as well but this is not working as expected.
Platos Desayuno return should be es but the return is tr. I am using this package to detect the food name and this is failing for multiple cases. Any help what kind of string should be avoided to detect.

Other strings are : The return type should be es but failing for most of the string.

  1. Guarniciones de
  2. ensaladas
  3. Panadería
  4. Licores
  5. Cerveza
  6. Café
  7. Bebidas
  8. Vino por copa
  9. Vino por botella
  10. Cocteles

This library rely on statistic analysis, which works well but require a certain amount of characters to build up some statistics and recognize some patterns. And tbh, that's the case of 90% of language detection library.

Based on your list and the fact that most of your names are under 16 characters, don't waste your time it's not the kind of algorithm you are looking for. Maybe AI can do slightly better but I still expect a high error rate as they are trained to detect documents most of the time.

The simpler and more accurate if you are using it for something specific like "food names" is probably to build some dictionary matching. Not trying to guess a language but just doing some straight word matching "vino" -> es

Because I got another issue (#19) about the same topic "Detection of short texts", I decided to create a FAQ

Here is the answer, a summary of what was already discussed in this issue