Differentiate Similar language

Question

Differentiate Similar language

kefniark opened this issue 3 years ago · comments

Kevin Destrem commented 3 years ago

Description

Some pair of language are always at the top of the detection errors:

pt -> es : 13.9375% (error: 1355)
en -> nl : 15.8565% (error: 884)
pt -> it : 5.431% (error: 528)
ru -> uk : 2.4686% (error: 240)

And all of them make sense, dutch and english are really close, same for portuguese and spanish.

The idea is to find a way to reduce the error rate by putting some extra weight on grams in only one language of the pair.

Kevin Destrem · Answer 1 · Thu Dec 16 2021 13:43:03 GMT+0800 (China Standard Time)

Started to investigate the idea of pre-building small n-gram dictionaries to identify gram unique to a language in a family.

Only make dictionaries for language groups with high error rate.
In the algorithm it would be a 4st steps, at the end of the process.

Example

Make dictionaries like Spanish - Portuguese, English-Dutch-German and identify grams unique to each language in those family.
Then at query time, if a chunk has both Spanish and Portuguese in the possible results, check if they have any of those "Unique" grams and weight the final %

Kevin Destrem · Answer 2 · Thu Jan 06 2022 09:08:36 GMT+0800 (China Standard Time)

Tried and it was slightly working for some pair of languages, but not for other and even cause some accuracy drop for some.
And overall the result was far from useful, only +0.25% accuracy for lot of dedicated code and data. I decided to give up on that and focus on other area for the moment