pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I believe withMinimumRelativeDistance()'s threshold should be dynamic depending on the top languages that are detected

cedivad opened this issue · comments

withMinimumRelativeDistance () is a great method of filtering out false positives, but it being an absolute number makes it unable to detect some languages at all!

Take Nynorsk and Bokmal for example. As their minimum relative distance is very small, 0.02 in the example below, you can't use them along with a minimum-distance threshold, or risk not detecting them entirely.

I'm sure there are many language couples that are this close, but this is the one that triggered some errors on my end.

Example: Da er dEn skriftrett leseuttale av dansk ble brukt som prekespråk i kirkene, og senere i konfirmasjonsundervisningen og skolene. I skolene vedvarte dette til Stortinget i 1878 vedtok at undervisningen skulle gis på

- has distances of: 

Nynorsk: 1.00
Bokmal: 0.98
Danish: 0.89
Swedish: 0.84
Estonian: 0.73

Here are some distances I calculated. I will use them outside of Lingua so I won't bother you with any code, but I'm still wondering whatever makes more sense:

  1. If first and second detected language of the list below match the output, just return the first as good.
  2. Scale the relativeDistance by the first number in the first row according to the detected language, don't bother with the second language
  3. Both (?)
0.6	Czech		Hungarian
0.14	Danish		Bokmal
0.29	German		Dutch
0	Greek
0.23	English		French
0.13	Spanish		Portuguese
0.33	Estonian	Finnish
0.35	Finnish		Estonian
0.34	French		Swedish
1	Hebrew		English
0.99	Hindi		Estonian
0.37	Croatian	Turkish
0.6	Hungarian	Czech
0.25	Italian		Spanish
0	Japanese
0	Korean
0.25	Dutch		German
0.52	Polish		Finnish
0.2	Portuguese	Spanish
0.37	Russian		Serbian
0.29	Serbian		Russian
0.29	Swedish		Bokmal
0	Thai
0.85	Turkish		Swedish
0.8	Vietnamese	Swedish
0	Chinese
0.37	Indonesian	Finnish
0.08	Nynorsk		Bokmal
0.29	Bokmal		German