English sentence seeing different results in all 3 versions

Question

English sentence seeing different results in all 3 versions

thewilkybarkid opened this issue 5 months ago · comments

We're using the heavy version (v1.3.4), and I've just spotted that A population perspective on international students in Australian universities is detected as fr rather than en.

FR 4.02%
EN 3.65%
LA 2.37%
FI 2.13%
LV 2.06%

Looking at the Playground, it would be recognised as lv using the normal version:

LV 2.06%
FR 1.66%
FI 1.48%
ET 1.45%
EN 0.89%

And only correct using the light version:

EN 3.33%
FR 2.29%
NL 1.72%
FI 1.45%
IT 1.45%

I don't know much about Tatoeba. When we see incorrect detection, would it make sense to add the sentence there and hope that it triggers a tweak in this library? (A few other issues are open like this; could there be some guidance about what to do?)

Chris Wilkinson · Answer 1 · Wed Feb 28 2024 22:30:10 GMT+0800 (China Standard Time)

Found a couple more:

DRAFT: Developing and implementing the semantic interoperability recommendations of the EOSC Interoperability Framework is confidently la rather than en in the heavy and normal versions; this looks to be triggered by 'EOSC'. I might be able to strip out acronyms/initialisms on our side, which sees it be en in all 3 versions.

Sardegna grassland mapping for livestock management: a practical Intra-Annual NDVI contrasts approach is confidently lt in heavy, fr in the normal and en only in the light. Removing the initialism ('NDVI') sees it be fr in heavy, fr in the normal and en in the light.