komodojp / tinyld

Simple and Performant Language detection library for NodeJS

Home Page:https://komodojp.github.io/tinyld/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Differentiate Similar language

kefniark opened this issue · comments

Description

Some pair of language are always at the top of the detection errors:

  • pt -> es : 13.9375% (error: 1355)
  • en -> nl : 15.8565% (error: 884)
  • pt -> it : 5.431% (error: 528)
  • ru -> uk : 2.4686% (error: 240)

And all of them make sense, dutch and english are really close, same for portuguese and spanish.

The idea is to find a way to reduce the error rate by putting some extra weight on grams in only one language of the pair.

Started to investigate the idea of pre-building small n-gram dictionaries to identify gram unique to a language in a family.

  • Only make dictionaries for language groups with high error rate.
  • In the algorithm it would be a 4st steps, at the end of the process.

Example

  • Make dictionaries like Spanish - Portuguese, English-Dutch-German and identify grams unique to each language in those family.
  • Then at query time, if a chunk has both Spanish and Portuguese in the possible results, check if they have any of those "Unique" grams and weight the final %

Tried and it was slightly working for some pair of languages, but not for other and even cause some accuracy drop for some.
And overall the result was far from useful, only +0.25% accuracy for lot of dedicated code and data. I decided to give up on that and focus on other area for the moment