komodojp / tinyld

Simple and Performant Language detection library for NodeJS

Home Page:https://komodojp.github.io/tinyld/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

English sentence seeing different results in all 3 versions

thewilkybarkid opened this issue · comments

We're using the heavy version (v1.3.4), and I've just spotted that A population perspective on international students in Australian universities is detected as fr rather than en.

FR 4.02%
EN 3.65%
LA 2.37%
FI 2.13%
LV 2.06%

Looking at the Playground, it would be recognised as lv using the normal version:

LV 2.06%
FR 1.66%
FI 1.48%
ET 1.45%
EN 0.89%

And only correct using the light version:

EN 3.33%
FR 2.29%
NL 1.72%
FI 1.45%
IT 1.45%

I don't know much about Tatoeba. When we see incorrect detection, would it make sense to add the sentence there and hope that it triggers a tweak in this library? (A few other issues are open like this; could there be some guidance about what to do?)

Found a couple more:

DRAFT: Developing and implementing the semantic interoperability recommendations of the EOSC Interoperability Framework is confidently la rather than en in the heavy and normal versions; this looks to be triggered by 'EOSC'. I might be able to strip out acronyms/initialisms on our side, which sees it be en in all 3 versions.

Sardegna grassland mapping for livestock management: a practical Intra-Annual NDVI contrasts approach is confidently lt in heavy, fr in the normal and en only in the light. Removing the initialism ('NDVI') sees it be fr in heavy, fr in the normal and en in the light.