I believe withMinimumRelativeDistance()'s threshold should be dynamic depending on the top languages that are detected
cedivad opened this issue · comments
withMinimumRelativeDistance ()
is a great method of filtering out false positives, but it being an absolute number makes it unable to detect some languages at all!
Take Nynorsk and Bokmal for example. As their minimum relative distance is very small, 0.02 in the example below, you can't use them along with a minimum-distance threshold, or risk not detecting them entirely.
I'm sure there are many language couples that are this close, but this is the one that triggered some errors on my end.
Example: Da er dEn skriftrett leseuttale av dansk ble brukt som prekespråk i kirkene, og senere i konfirmasjonsundervisningen og skolene. I skolene vedvarte dette til Stortinget i 1878 vedtok at undervisningen skulle gis på
- has distances of:
Nynorsk: 1.00
Bokmal: 0.98
Danish: 0.89
Swedish: 0.84
Estonian: 0.73
Here are some distances I calculated. I will use them outside of Lingua so I won't bother you with any code, but I'm still wondering whatever makes more sense:
- If first and second detected language of the list below match the output, just return the first as good.
- Scale the relativeDistance by the first number in the first row according to the detected language, don't bother with the second language
- Both (?)
0.6 Czech Hungarian
0.14 Danish Bokmal
0.29 German Dutch
0 Greek
0.23 English French
0.13 Spanish Portuguese
0.33 Estonian Finnish
0.35 Finnish Estonian
0.34 French Swedish
1 Hebrew English
0.99 Hindi Estonian
0.37 Croatian Turkish
0.6 Hungarian Czech
0.25 Italian Spanish
0 Japanese
0 Korean
0.25 Dutch German
0.52 Polish Finnish
0.2 Portuguese Spanish
0.37 Russian Serbian
0.29 Serbian Russian
0.29 Swedish Bokmal
0 Thai
0.85 Turkish Swedish
0.8 Vietnamese Swedish
0 Chinese
0.37 Indonesian Finnish
0.08 Nynorsk Bokmal
0.29 Bokmal German